On Classification: There Are Two Types of People

There's an old (bad) joke that goes: when it comes to classification, there are two types of people -- those who get it and those who don't. (Get it? The two types are the classes.)

Unfortunately, as a content neophyte, I put myself in the second group. I will nevertheless attempt to provide an overview of what I learned during Mary Holstege's talk yesterday at the 3rd "annual" (the first two were bi-annual) Mark Logic user conference. Mary's talk previewed the industry's first XML-aware classifier, which will be rolled out in a soon-to-be-announced, new release of MarkLogic Server.

The new classifier is fairly standard in that it uses a support vector machine (SVM) to classify documents into multiple buckets. Conceptually, given a collection of points (documents) it tries to draw the optimal line through them to separate them into a defined number of classes. In reality, because the classification is multidimensional, it's actually trying to draw a hyperplane through them. The math behind it gets quite hideous quite quickly so even venturing to the Wikipedia page on SVM is not for the faint of heart.

What's unusual about the classifier is that it's XML-aware. Specifically, it works in conjunction with the MarkLogic XML indexer which understands how to index content (e.g., words, phrases), structure (e.g., XML elements), and most importantly, the combination of the two.

Just watching the crowd you could see that the old joke is actually quite true. About half the crowd was grooving, really appreciating the power of the innovation. The other half, including me, were holding on for dear life.

For me, the technology was reminiscent of data mining. You feed the classifier a training set -- ideally a statistically significant, random subset of the documents to be classified. You first manually classify the training set into the desired classes (e.g., sort cases into torts, litigation, and bankruptices). Then you "train" the classifier by feeding it the already-classified training set. It then builds its mathematical model. Then the fun begins -- you feed it the rest of the documents and it classifies them.

There are lots of control knobs that let you tune the algorithm (which, like data mining, seems like a bit of a dark art to configure). There is lots of math to let you know how well the classifier thinks it's doing. And, of course, in the end, you can always walk thru the collection and spot-check various documents to see if you agree with what it did.

Mary's demo was interesting. She trained the classifier using 1/2 of Shakespeare's works, classifying them into tragedy, comedy, and history. She tuned the algorithm a bit to show the control knobs. Then she demonstrated classifying the rest of Shakespeare using the algorithm. (Yes, it worked well.) Then, showing some XML power, she demonstrated classifying the individual lines of the plays along the same buckets. Then, showing more XML power, she classified characters according to the number of lines of each type that they delivered.

Then, for the

coup de grace

, she showed some real robustness in the system by using the tuned algorithm to classify U2 lyrics. And it still worked reasonably well. (Which technically, it shouldn't have, because the training set was about as far from U2 lyrics as one can imagine.)

At the end of the speech, I tracked down some of the first-group (who got it) people and asked them why they were so excited. This is what they told me. The problem with most classifiers, it seems, is that they only work well for short documents. Newsfeeds were cited as an optimal example -- lots of short articles. In longer works it seems that the differences among them get damped out by the volume of content, so it gets progressively harder for the classifier to do its job.

Then (at least one aspect of) the power of our innovation struck me. By doing XML-based classification, you can eliminate the damp-out problem by ignoring large portions of the documents. For example, you could classify journal articles by their abstracts, or their captions, or their citations. Or (I think it's possible to do this) by the text in the first few and last few paragraphs.

I'll post more on this in the future as I learn more. But the early indicators are that XML classification may be a big breakthrough in the content classification market.

On Classification: There Are Two Types of People

Read more

Book Review: The Curious Case of Mike Lynch by Katie Prescott

Why I'm Joining the Board of Dreamdata

The Metrics Brothers Hiatus

A Diamond in the Rough: Startup Founder Survival Guide by David Politis