Monthly Archives: September 2005

Uh, Oh — It’s Magic

As a self-described data turncoat, I confess that one of my goals in hopping the data/content border when I joined Mark Logic was to bring lessons from the data world to the content world. I felt that far too few people had lived in both worlds and there would be a big opportunity to leverage my experience from one in the other.

In today’s posting, I’ll share my thoughts on “magic” that I’ve learned from my years on the data side of the data/content border.

I’ll start with an observation from the Innovations in Search conference this week in New York. In practically each session speakers asked “what is beyond search?” I was first shocked and then puzzled. I realized that in 20+ years, I have never seen a category so preoccupied with its own demise. Database people don’t spend their time asking “what’s beyond databases?” BI people don’t obsess over “what’s beyond BI?” Instead, they worry about improving and evolving their products.

I think this preoccupation reveals that something is fundamentally wrong with search. Customers aren’t happy with it. The sense you get is that while everyone is dissatisfied, no one’s found anything better so we will all just muddle through with a least-bad solution. Or, as one speaker bluntly put it, “search sucks.”

I’ll tell you one thing that I think is wrong with search: magic. If there’s one lesson I learned in 20+ years in the data world, it’s that magic doesn’t sell. Customers don’t like magic. Consider the fate of data mining, for example.

Data mining is infinitely more powerful than traditional business intelligence. Yet the BI market today is billions of dollars and growing while the data mining market remains stubbornly stuck in the tens of millions. How can that be? Why is it that the market for manual database query, enterprise reporting, and manual slice/dice/drill analysis is 100 times bigger than the market for neural networking, case-based reasoning, genetic algorithms, decision trees, and the like? I have some direct experience here, which I’ll share.

I launched a data mining product at Business Objects. On the heels of our successful BusinessObjects 4.0 “OLAP for the masses” launch, we decided to put more analytic power on the desktop. We did a thorough evaluation of data mining products, signed an OEM contract with our chosen vendor, re-labeled the product, and launched it as BusinessMiner, under the marketing banner “data mining for the masses.”

As it turned out, the masses wanted nothing to do with data mining. The product died a quiet death a few years after its launch. We understood the technical issues with data mining (e.g., training sets, overtraining, extrapolation) when we launched the product. But those problems didn’t kill BusinessMiner. Magic did.

I chuckle when I see search vendors chest-thumping about Bayesian this-or-that in their marketing. I majored in math at Berkeley and barely understand Bayesian inferencing. Will your average customer do any better? When vendors think they’re impressing customers with their algorithms, they’re just scaring them.

You only feel worse in knowing that each Internet search engine claims to have the best relevance magic – but another speaker pointed out that the typical overlap among the top three is less than 5%.

I’m not naïve. I know that with content the volume of information and the complexity of the task inevitably require some magic. But the goals should be to

  • Use as little magic as possible
  • Provide as much transparency as possible

Here is where XML comes to the rescue. What I like so much about MarkLogic-based solutions is that they have right mix of magic and pragmatism. Here’s how they work.

  • You start with a collection of content
  • If not in XML, we can convert it to XML. This involves low-magic such as identifying basic structure (e.g., titles, sections, paragraphs)
  • Then you can enrich that content. This can be done with high-magic such as entity, fact, concept, and sentiment extractors or with linguistic processors. Or this can be done with low-magic, using heuristics and XQuery. Regardless of approach, I call this verifiable magic. It’s not totally black box. The output of the magic is additional XML tags inserted in the original content. So you can read 100 documents and decide if they were correctly classified as positive, negative, or neutral. Or if people and place identification was correct. Or if, in context, Bob really is a noun and chip a verb.
  • Then you run XQuery against the enriched content. XQuery is a database query language so there is no magic involved – you can leverage tags created either manually or through magic and run database-style queries against them. (And yes, if you want relevance-ordered results, then there is some low-magic TF/IDF involved in that process.)

The thing I like about this model is the near-total separation of magic and non-magic. Use no-magic or low-magic to get XML. Use high-magic (if desired) to insert more tags into it. Then use XQuery (basically no magic) to run database-style queries against the content.

I think it’s the right approach to the problem.

The Bifurcation of Search

I often hear the question asked: “what’s after search?” “Find,” one would hope. After all, searching’s not the point. Finding is.

So the real question is: “how can we improve find?” And answering it requires separating the cases of Internet and non-Internet search applications. Let’s do a brief history of Internet search to see why.

First-generation Internet search engines analyzed the content of a page to decide its relevance to a particular search. In effect, relevancy was determined by the page’s author, in deciding what words to use in his writing and what meta-tags to add to the page. This wasn’t a bad way to start, but it lent itself to abuses.

It wasn’t long before spammers figured out how to head-fake search engines into thinking their web pages were more relevant than the other guys’. They’d put meta-tags on their pages that would repeat a key word over and over (e.g., “sex”) in order to increase their relevancy to given searches. Thus was born search engine optimization.

Google came along and suggested that the link structure of the web was a better way to determine relevancy than page content. I first read about this in the mid-1990s in an Esther Dyson newsletter and I remember that it struck me as a truly revolutionary idea. In effect, it said, that relevancy should be determined by other webmasters. It hobbled the spammers, produced better Internet search results, and combined with the idea of paid-for results (borrowed from goto.com), and propelled Google to great fame and fortune.

But this idea is nearly ten years old today, search engine optimizers are learning how to beat it, and it has proven to be more effective for responding to common searches than those in the proverbial “long tail” of search.

Most of the energy in the search community is dedicated to finding the next pagerank, the next magic algorithm that will produce better search results and propel someone else to fame and fortune.

Pundits seem split on what it will be. Various arguments include:

  • Taxonomies and folksnomies that help people navigate to the answer instead of searching for it.
  • Dynamic clustering a la Vivisimo where search grouped into clusters that can be used to incrementally refine a search.
  • Behavioral techniques, that use cookies to watch (some would say “spy on”) users and their surfing habits, applying what is learned to bias (or, “personalize”) search results

As a quick aside, I find somewhat amazing that Google appears to be doing reasonably well with their enterprise search appliance when the Google “secret sauce” is all about pagerank and pagerank is all about link structure. Why? Because inside an enterprise, you typically don’t have link structure to work with. Instead you have file folders full of marketing brochures, HR policies, sales presentations, contract templates, RFPs et cetera and nothing is linked together.

That this is lost on most people is a triumph of branding and makes pagerank-free Google somewhat akin to caffeine-free Jolt as a product.

Searching the Internet is fundamentally different from searching the corpus of content held by an enteprise, a publisher, or a government agency. (The former is Internet search and the latter is enterprise search.)

While I’ll defer the question of how to make Internet search better to those gurus who study it for a living, I’ll propose a simple answer on how to make enterprise search better: make it more like a database. Make it more like a query.

Database people don’t search databases. They query them. They don’t expect 10 links in a magical order that might answer their question. They expect a precise answer to their query. They don’t expect to know there are 10,235,350 results and only be able to see the first 1,000 of them. They don’t expect latency when a new document is added to the system. If a row is committed to the database, it is included in all answers from the moment after the commit. You don’t wait 2 days, or even 10 minutes, to start including it in query results.

I believe the future of enterprise search is here. It’s all about two technologies:

  • XML markup enrichers — the broad category of products that input XML and output better XML, typically inserting markup to indicate things like entities, concepts, linguistics, and sentiment.
  • XQuery-based XML content servers, like MarkLogic.

With these two technologies you can take a collection of documents, convert them to XML, enrich the XML markup, and then run powerful queries against them. The best search isn’t search. It’s query.

We Seek The Grail

Everyone knows that content is largely unmanaged, untapped as an enterprise resource. We’ve all heard the soundbite (usually attributed to IDC) that 80% of enterprise information is unstructured and not kept in a database. We all believe there is a lot of value in tapping into that content.

So, we’re all on the same mission, right? We’re all seeking the same Grail. Or are we?

As a long-time data guy, I know well the Grail that data people seek — it’s the data/content integration Grail. The stump speech I used at Business Objects went something like this.

“Oh yeah, there’s that other stuff — content, I think it’s called — that doesn’t fit in your relational databases, so you can’t access it with our BI tools. Well, I guess it’s obvious what you want to do with that content — structure it up and summarize it, so you can shove it in the warehouse right alongside your data … and then make reports on it using our tools.”

When you have a $1B data business, you see content through data-colored glasses.

The example I used in my past life was customer service emails. “What you really want is to take the hundreds of feedback emails that flow in every week and make a cross-tab report that groups them along two dimensions: by product and by tone (i.e., positive, negative, neutral). In that way, you can take a mass of otherwise useless, unstrutcured content, and use it to enrich your existing dashboards and reports.”

It’s a good example. It’s a real example. Lots of people want to do it. But it is not the only example. Integration with data is not the sole reason to unlock content. Many important content applications have little or no data/content integration angle:

  • Custom publishing
  • Content delivery
  • Contract management
  • Content integration
  • Knowledge management
  • RFP management
  • Technical publications
  • Financial publishing
  • Search and discovery
  • Archiving
  • Content intelligence

Just to name a few.

If you have a 2 TB data warehouse and you want summarize some unstructured content into data and then load it alongside your existing tables, then you should acquire a text analytics tool to do the summarization and then store the resulting data in your data warehouse.

If, on the other hand, you are working on systems that need to query, manipulate, and render content (such as those listed above) then I’d argue you are seeking a different Grail. It’s about content for content’s sake … and not about turning content into data.

So the question is not do you seek the Grail, but indeed which Grail do you seek?