Semantic Technology at the New York Times

I recently had the pleasure of meeting Evan Sandhaus, semantic technologist at The New York Times R&D, and wanted to highlight and share a few things that we discussed.

Evan gave an information-packed, 79-slide keynote address at the recent Semantic Technology Conference in San Jose. During our meeting, we went through some of the slides and they were fantastic. While the slides aren’t publicly posted, I hope they soon will be and will update this post with a link once and if they are.

He also told me about the New York Times’ recent release of a 1.8M article corpus to the computer science research community, known as The New York Times Annotated Corpus. The corpus includes nearly every article published in the New York Times for twenty years (between 1/1/87 and 6/19/07) in XML format (NITF to be precise) along with various metadata about the articles.

They believe the corpus can can be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction. I think that’s true not only because it’s real content in real volume, but because that content comes with real, high-quality metadata that you can use to either build upon and/or validate various text processing algorithms.

Finally, in prepping for the meeting I found this video interview with Evan at the New York Semantic Meetup. Great stuff, embedded below.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s