On Classification: There Are Two Types of People

There’s an old (bad) joke that goes: when it comes to classification, there are two types of people — those who get it and those who don’t. (Get it? The two types are the classes.)

Unfortunately, as a content neophyte, I put myself in the second group. I will nevertheless attempt to provide an overview of what I learned during Mary Holstege’s talk yesterday at the 3rd “annual” (the first two were bi-annual) Mark Logic user conference. Mary’s talk previewed the industry’s first XML-aware classifier, which will be rolled out in a soon-to-be-announced, new release of MarkLogic Server.

The new classifier is fairly standard in that it uses a support vector machine (SVM) to classify documents into multiple buckets. Conceptually, given a collection of points (documents) it tries to draw the optimal line through them to separate them into a defined number of classes. In reality, because the classification is multidimensional, it’s actually trying to draw a hyperplane through them. The math behind it gets quite hideous quite quickly so even venturing to the Wikipedia page on SVM is not for the faint of heart.

What’s unusual about the classifier is that it’s XML-aware. Specifically, it works in conjunction with the MarkLogic XML indexer which understands how to index content (e.g., words, phrases), structure (e.g., XML elements), and most importantly, the combination of the two.

Just watching the crowd you could see that the old joke is actually quite true. About half the crowd was grooving, really appreciating the power of the innovation. The other half, including me, were holding on for dear life.

For me, the technology was reminiscent of data mining. You feed the classifier a training set — ideally a statistically significant, random subset of the documents to be classified. You first manually classify the training set into the desired classes (e.g., sort cases into torts, litigation, and bankruptices). Then you “train” the classifier by feeding it the already-classified training set. It then builds its mathematical model. Then the fun begins — you feed it the rest of the documents and it classifies them.

There are lots of control knobs that let you tune the algorithm (which, like data mining, seems like a bit of a dark art to configure). There is lots of math to let you know how well the classifier thinks it’s doing. And, of course, in the end, you can always walk thru the collection and spot-check various documents to see if you agree with what it did.

Mary’s demo was interesting. She trained the classifier using 1/2 of Shakespeare’s works, classifying them into tragedy, comedy, and history. She tuned the algorithm a bit to show the control knobs. Then she demonstrated classifying the rest of Shakespeare using the algorithm. (Yes, it worked well.) Then, showing some XML power, she demonstrated classifying the individual lines of the plays along the same buckets. Then, showing more XML power, she classified characters according to the number of lines of each type that they delivered.

Then, for the coup de grace, she showed some real robustness in the system by using the tuned algorithm to classify U2 lyrics. And it still worked reasonably well. (Which technically, it shouldn’t have, because the training set was about as far from U2 lyrics as one can imagine.)

At the end of the speech, I tracked down some of the first-group (who got it) people and asked them why they were so excited. This is what they told me. The problem with most classifiers, it seems, is that they only work well for short documents. Newsfeeds were cited as an optimal example — lots of short articles. In longer works it seems that the differences among them get damped out by the volume of content, so it gets progressively harder for the classifier to do its job.

Then (at least one aspect of) the power of our innovation struck me. By doing XML-based classification, you can eliminate the damp-out problem by ignoring large portions of the documents. For example, you could classify journal articles by their abstracts, or their captions, or their citations. Or (I think it’s possible to do this) by the text in the first few and last few paragraphs.

I’ll post more on this in the future as I learn more. But the early indicators are that XML classification may be a big breakthrough in the content classification market.

One response to “On Classification: There Are Two Types of People

  1. Excommunicated by Reverend Bayes

    The reason you cite regarding classification working reasonably well is exactly why search vendors in this space who claim to have automated classification and taxonomy generation rely on news feeds for their demos and proof of concepts..once the check in written and the sw delivered and they actual corpus of content is ingested the classification results are usually nothing short of horrifying..which then leaves the account team scrambling for the first exit door. Unfortunately for me personally, as a purveyor of this type of tech..alonside enterprise search..the net result was typical refrain was…I sold to that customer once…better not go back in there…unless they have a burning mission critical need to categorize a news feed on the home page of your intranet :)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.