The Bifurcation of Search

I often hear the question asked: “what’s after search?” “Find,” one would hope. After all, searching’s not the point. Finding is.

So the real question is: “how can we improve find?” And answering it requires separating the cases of Internet and non-Internet search applications. Let’s do a brief history of Internet search to see why.

First-generation Internet search engines analyzed the content of a page to decide its relevance to a particular search. In effect, relevancy was determined by the page’s author, in deciding what words to use in his writing and what meta-tags to add to the page. This wasn’t a bad way to start, but it lent itself to abuses.

It wasn’t long before spammers figured out how to head-fake search engines into thinking their web pages were more relevant than the other guys’. They’d put meta-tags on their pages that would repeat a key word over and over (e.g., “sex”) in order to increase their relevancy to given searches. Thus was born search engine optimization.

Google came along and suggested that the link structure of the web was a better way to determine relevancy than page content. I first read about this in the mid-1990s in an Esther Dyson newsletter and I remember that it struck me as a truly revolutionary idea. In effect, it said, that relevancy should be determined by other webmasters. It hobbled the spammers, produced better Internet search results, and combined with the idea of paid-for results (borrowed from goto.com), and propelled Google to great fame and fortune.

But this idea is nearly ten years old today, search engine optimizers are learning how to beat it, and it has proven to be more effective for responding to common searches than those in the proverbial “long tail” of search.

Most of the energy in the search community is dedicated to finding the next pagerank, the next magic algorithm that will produce better search results and propel someone else to fame and fortune.

Pundits seem split on what it will be. Various arguments include:

  • Taxonomies and folksnomies that help people navigate to the answer instead of searching for it.
  • Dynamic clustering a la Vivisimo where search grouped into clusters that can be used to incrementally refine a search.
  • Behavioral techniques, that use cookies to watch (some would say “spy on”) users and their surfing habits, applying what is learned to bias (or, “personalize”) search results

As a quick aside, I find somewhat amazing that Google appears to be doing reasonably well with their enterprise search appliance when the Google “secret sauce” is all about pagerank and pagerank is all about link structure. Why? Because inside an enterprise, you typically don’t have link structure to work with. Instead you have file folders full of marketing brochures, HR policies, sales presentations, contract templates, RFPs et cetera and nothing is linked together.

That this is lost on most people is a triumph of branding and makes pagerank-free Google somewhat akin to caffeine-free Jolt as a product.

Searching the Internet is fundamentally different from searching the corpus of content held by an enteprise, a publisher, or a government agency. (The former is Internet search and the latter is enterprise search.)

While I’ll defer the question of how to make Internet search better to those gurus who study it for a living, I’ll propose a simple answer on how to make enterprise search better: make it more like a database. Make it more like a query.

Database people don’t search databases. They query them. They don’t expect 10 links in a magical order that might answer their question. They expect a precise answer to their query. They don’t expect to know there are 10,235,350 results and only be able to see the first 1,000 of them. They don’t expect latency when a new document is added to the system. If a row is committed to the database, it is included in all answers from the moment after the commit. You don’t wait 2 days, or even 10 minutes, to start including it in query results.

I believe the future of enterprise search is here. It’s all about two technologies:

  • XML markup enrichers — the broad category of products that input XML and output better XML, typically inserting markup to indicate things like entities, concepts, linguistics, and sentiment.
  • XQuery-based XML content servers, like MarkLogic.

With these two technologies you can take a collection of documents, convert them to XML, enrich the XML markup, and then run powerful queries against them. The best search isn’t search. It’s query.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.