The Horseless Library
Digital Library Discussions
All | JT | General

20061212 Tuesday December 12, 2006

Nifty search tool for Iraq report

if:book has an interesting post on inventive ways to present public-domain texts. http://www.futureofthebook.org/blog/archives/2006/12/how_would_you_design_the_iraq.html

One example is Vivisimo, a search engine company, who did the following:

"The search engine crawled the PDF file of the final report issued by
the Iraq Study Group and indexed it by paragraph rather than as a
single document. By breaking the 142 page file into paragraphs, readers
of the report can now search for specific aspects that are of interest
to them instead of having to read through the entire document or
perform a tedious keyword search within the document using the Acrobat
application. When a search query is entered, the search engine returns
the relevant paragraphs in the search results. Additionally, the Velocity Clustering Engine
is used to cluster the search results into related topic areas. The
clusters allow the reader to easily browse related information and
uncover relationships between topics within the report. This demo was
placed online by Vivísimo within minutes of the publication of the
final report, showing the speed with which the Vivísimo Velocity Search Platform can be deployed."

I don't intuitively see how they've selected the topics that "cluster" under my search terms, but it's an interesting set of views of this text.

I wonder what we could learn from this. Is it (clustered/faceted views of search results) something we could do with our own digital content and Endeca? I'd like to consider useful ways to offer access to content we create, and would be glad to hear from those with a grasp of the technical challenges.

Posted by Monica McCormick | Dec 12 2006, 06:08:01 PM EST | Permalink | Comments [4]

Comments:

Here is a direct link to the Vivisimo Iraq Study Group demo.

Posted by Tito Sierra on December 13, 2006 at 05:00 PM EST #

You might want to see about clearing out your referrers list... a NCSU hosted blog is showing referrers of "cream-me.com" and "hotsweetteens.com" which is rather amusing.

As for Endeca's ability to parse data into paragraphs (I'm not an Endeca engineer or employed by them, just a new Endeca user so take this with a grain of salt) I would say "yes" since you can create record manipulators and data modification scripts using PERL. PERL is outstanding at parsing text so I would assume that you can convert any document into text (using a record manipulator) and then parse that text into individual paragraph records for inclusion into an index.

It just so happens I'm in an Endeca training class right now, maybe I'll ask.

Posted by Havagan on December 14, 2006 at 10:02 AM EST #

From what I understand, Endeca can indeed do this with their Relationship Discovery module...something I look forward to playing around with in version 5.0.

Posted by Andrew Pace on December 14, 2006 at 01:47 PM EST #

I haven't played around with 5.0 yet, but I raised the question regarding using a PERL manipulator (or PERL inside of a record manipulator) with 4.8.2 and the instructor agreed that it could easily be done. The idea intrigues me so I'll be playing with it over the next couple of weeks as it would be a nice feature for our document search.

Posted by Havagan on December 14, 2006 at 02:11 PM EST #

Post a Comment:

Comments are closed for this entry.


Horseless Library image by Herman Berkhoff
Archives
Links