
Monday July 10, 2006
Snippets
Text snippets are different from abstracts and summaries because they are algorithmically extracted from the source text, rather than editorially created to function as a summary or teaser. For example, compare the news headline treatments of Google News with The New York Times online. In its headline blurbs, Google News uses the beginning of the source news article up to a prescribed number of words or characters, as the snippet. The New York Time blurb is hand authored, and functions as a traditional abstract. The Google News approach arguable employs the most common snippet heuristic, employed in RSS feeds, blog comments, product reviews, etc. The presumption here, I think, is that the beginning of the text is the most useful part of the text to use in the snippet.
Search engines often employ a different method for generating snippets in search results. Google results typically contain auto-generated snippets derived by extracting and combining sentence fragments from the indexed webpage that contain the keyword(s) searched for by the user. This turns out to be useful method for generating teaser text because it literally puts the keyword in context. A similar method of generating snippets is used in Google Book Search.
Recently I learned of The Final Word, a self-described media experiment, that presents New York Times headlines by conjoining the headline with the last paragraph of the Times article. In other words, the "punchline" is used as a teaser for the article. In some cases the last paragraph functions as a true summary. In other cases the last paragraph consists only of a pithy quote. It's unclear to me how useful this is for scanning headlines, but it does make me think that snippet generation is more of an art than a science.
I can imagine a variety of algorithms and heuristics for generating snippets that are more or less useful for specific audiences, or specific types of content. A simple example is competitive intelligence. Corporations have an interest in what their competitors are up to, and are especially interested in news where their own corporation is mentioned, even in cases when they are not the focus of the article. In this context it would be useful to summarize the article by conjoining sentences containing the company names (self and competitors), perhaps highlighting article headlines that contain both. For reviews I wonder if adjectives could play a useful role in snippet generation.
It also seems to me that there is a big difference between creating summaries and creating teaser text.
Can you think of other methods for generating snippets? Are snippets evil?
Posted by Tito Sierra
| Jul 10 2006, 01:28:25 PM EDT
| Permalink
|
|
I'm sure that snippet generation, like so much, depends on the context. Where you're talking about generating snippets from journalistic content, you can be pretty sure that the text will follow a certain reliable pattern.
Imagine if you had to create teasers (or for that matter, summaries) for literary novels or poetry. You'd still have the best results with an algorithm that snipped the beginning or the ending, I think, but I'd argue that those genres are less formulaic and therefore less amenable to an algorithm. For texts like those you'd probably be better off with human intervention.
Nah, snippets aren't evil. But they're like the tips of icebergs that might sink your luxury liner if you don't watch out.
Posted by Amanda French on July 10, 2006 at 11:36 AM EDT #
By the way, you didn't mention this, but to me it looks like the NY Times uses human agency to create its blurb / summary. Is that right?
Posted by Amanda French on July 10, 2006 at 11:38 AM EDT #
Amanda, you've touched on a point I didn't get around to mentioning in the posting, the important role that writing style and structure can play in automated snippet generation. I agree that text that follows a predictable pattern opens up more opportunities for interesting snippet generation than text that does not. I would expect certain types of government bulletins and reports to be ripe for this sort of thing.
Now that you mention it, I wonder if anyone has done research on automated summarization of literary work. With the recent activity in retrospective mass book digitization, I can see some value in providing text summarization tools for works for which no authored description was created. Even if authored descriptions did exist it might be interesting to compare different work summarized using the same algorithm.
All this reminds me of the Amazon Concordance feature (Ex: Bleak House). Not exactly a snippet, but could be described as a summarization lens.
As for the NYT, you are correct, those are hand authored. I mentioned it a basis for comparison against Google News whose snippets are auto-generated. I've clarified the posting to make this distinction clear.
Posted by Tito Sierra on July 10, 2006 at 01:50 PM EDT #