As a self-described data turncoat, I confess that one of my goals in hopping the data/content border when I joined Mark Logic was to bring lessons from the data world to the content world. I felt that far too few people had lived in both worlds and there would be a big opportunity to leverage my experience from one in the other.
In today’s posting, I’ll share my thoughts on “magic” that I’ve learned from my years on the data side of the data/content border.
I’ll start with an observation from the Innovations in Search conference this week in New York. In practically each session speakers asked “what is beyond search?” I was first shocked and then puzzled. I realized that in 20+ years, I have never seen a category so preoccupied with its own demise. Database people don’t spend their time asking “what’s beyond databases?” BI people don’t obsess over “what’s beyond BI?” Instead, they worry about improving and evolving their products.
I think this preoccupation reveals that something is fundamentally wrong with search. Customers aren’t happy with it. The sense you get is that while everyone is dissatisfied, no one’s found anything better so we will all just muddle through with a least-bad solution. Or, as one speaker bluntly put it, “search sucks.”
I’ll tell you one thing that I think is wrong with search: magic. If there’s one lesson I learned in 20+ years in the data world, it’s that magic doesn’t sell. Customers don’t like magic. Consider the fate of data mining, for example.
Data mining is infinitely more powerful than traditional business intelligence. Yet the BI market today is billions of dollars and growing while the data mining market remains stubbornly stuck in the tens of millions. How can that be? Why is it that the market for manual database query, enterprise reporting, and manual slice/dice/drill analysis is 100 times bigger than the market for neural networking, case-based reasoning, genetic algorithms, decision trees, and the like? I have some direct experience here, which I’ll share.
I launched a data mining product at Business Objects. On the heels of our successful BusinessObjects 4.0 “OLAP for the masses” launch, we decided to put more analytic power on the desktop. We did a thorough evaluation of data mining products, signed an OEM contract with our chosen vendor, re-labeled the product, and launched it as BusinessMiner, under the marketing banner “data mining for the masses.”
As it turned out, the masses wanted nothing to do with data mining. The product died a quiet death a few years after its launch. We understood the technical issues with data mining (e.g., training sets, overtraining, extrapolation) when we launched the product. But those problems didn’t kill BusinessMiner. Magic did.
I chuckle when I see search vendors chest-thumping about Bayesian this-or-that in their marketing. I majored in math at Berkeley and barely understand Bayesian inferencing. Will your average customer do any better? When vendors think they’re impressing customers with their algorithms, they’re just scaring them.
You only feel worse in knowing that each Internet search engine claims to have the best relevance magic – but another speaker pointed out that the typical overlap among the top three is less than 5%.
I’m not naïve. I know that with content the volume of information and the complexity of the task inevitably require some magic. But the goals should be to
- Use as little magic as possible
- Provide as much transparency as possible
Here is where XML comes to the rescue. What I like so much about MarkLogic-based solutions is that they have right mix of magic and pragmatism. Here’s how they work.
- You start with a collection of content
- If not in XML, we can convert it to XML. This involves low-magic such as identifying basic structure (e.g., titles, sections, paragraphs)
- Then you can enrich that content. This can be done with high-magic such as entity, fact, concept, and sentiment extractors or with linguistic processors. Or this can be done with low-magic, using heuristics and XQuery. Regardless of approach, I call this verifiable magic. It’s not totally black box. The output of the magic is additional XML tags inserted in the original content. So you can read 100 documents and decide if they were correctly classified as positive, negative, or neutral. Or if people and place identification was correct. Or if, in context, Bob really is a noun and chip a verb.
- Then you run XQuery against the enriched content. XQuery is a database query language so there is no magic involved – you can leverage tags created either manually or through magic and run database-style queries against them. (And yes, if you want relevance-ordered results, then there is some low-magic TF/IDF involved in that process.)
The thing I like about this model is the near-total separation of magic and non-magic. Use no-magic or low-magic to get XML. Use high-magic (if desired) to insert more tags into it. Then use XQuery (basically no magic) to run database-style queries against the content.
I think it’s the right approach to the problem.