Evan gave an information-packed, 79-slide keynote address at the recent Semantic Technology Conference in San Jose. During our meeting, we went through some of the slides and they were fantastic. While the slides aren’t publicly posted, I hope they soon will be and will update this post with a link once and if they are.
He also told me about the New York Times’ recent release of a 1.8M article corpus to the computer science research community, known as The New York Times Annotated Corpus. The corpus includes nearly every article published in the New York Times for twenty years (between 1/1/87 and 6/19/07) in XML format (NITF to be precise) along with various metadata about the articles.
They believe the corpus can can be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction. I think that’s true not only because it’s real content in real volume, but because that content comes with real, high-quality metadata that you can use to either build upon and/or validate various text processing algorithms.