Monday, August 07, 2006

"Text Mining" The New York Times at UC Irvine

From the news release:
Performing what a team of dedicated and bleary-eyed newspaper librarians would need months to do, scientists at UC Irvine have used an up-and-coming technology to complete in hours a complex topic analysis of 330,000 stories published primarily by The New York Times.
Text mining allows a computer to extract useful information from unstructured text. Until recently, text mining required a great deal of preparation before documents could be analyzed in a meaningful way. A new text-mining technique called “topic modeling” – which UCI scientists used in their New York Times experiment – looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics – all with minimal human effort.

UCI researchers (David Newman, Padhraic Smyth, Mark Steyvers, and Chaitanya Chemudugunta) didn’t invent topic modeling, but they developed a technique that allows the technology to be used on huge document collections. They also are among the first to demonstrate its ease and effectiveness by applying it to a newspaper archive. The results reveal few surprises, but the application demonstrates the ability of topic modeling to spot trends and make connections in a way that could be applied to more complicated and cumbersome documents such as those used by medical researchers and lawyers.

A 13-page paper presenting this research can be found here.

Web Analytics