Cache as Cache Can

I was on vacation skiing in Tahoe last week and this week I was busy in NYC meeting with customers and analysts, so apologies for the hiatus.

Today’s topic is a spin / perception issue that’s bothered me since joining Mark Logic. It relates to a fundamental difference between database management systems (DBMSs) and search engines.

I’ll make two statements, offer the standard response to each, and then contrast them.

Statement 1: XML Content Server

Hi, at Mark Logic, we make an XML content server, which is a special-purpose DBMS for managing content. Like any DBMS, you need to load your content, we then index it, we guarantee synchronization of the content and the index, and then you can query it at will.

Response 1: Oh, you mean I need to make a *copy* of my content in your system?

Statement 2: Enterprise Search Engine

Hi, at search vendor X, we make an enterprise search engine. Search engine X indexes content “where it lies.” Because content is merely indexed, and not “loaded” into the system, we do not guarantee basic transactional database capabilities such as:

  • Ensuring that all documents added to the system are immediately returned by all relevant queries
  • Ensuring that an indexed document still contains the supposedly matching word or phrase; it could have been changed and no longer be a valid hit
  • Indeed, ensuring that indexed document continues to exist at all

Search engine X allows you to run only one query against your content — return a list of links to documents containing word/phrase, perhaps with some parameterized options thrown in.

While search engine X does not load content into the system as a database would, it does:

  • Load snippets to provide context for search hits. So in addition to getting a list of links to documents that contain “cow” you can see that cow was in the sentence “the cow jumped over the moon” in document 1.
  • Cache copies of the indexed pages in the search index

Response 2: Phew, glad you’re not copying anything. I hate making extra copies of my content.

Comparison / Contrast

Response 2 drives me nuts because it’s such great spin. “Oh, we’re not loading anything or copying anything. We’re just keeping a cached copy in the index and, depending on the implementation, separately copying snippets for each token that’s being indexed.”

I’m not arguing that copies or caches are bad. I am arguing, however, that to think an XML content server makes a copy of your content while a search engine doesn’t is just plain wrong.

In fact, it’s often the opposite. While some customers keep master content in Word or other formats and use MarkLogic to index and manipulate an XML shadow copy of it, many others use MarkLogic as their XML content repository where the one (and only) XML master copy is stored. So instead of having the master copy on the file system and a cache of it in the search index, you have the master copy, once, in MarkLogic.

And this discussion overlooks two key items

  • Functionality. With MarkLogic, you can write powerful queries in a high-level, standard language (XQuery). With a search engine, you execute simple queries through proprietary search engine APIs and then finish the job yourself with a DOM tree and Java.
  • Compression. My tech team tells me it’s not unusual for customers to load content into MarkLogic and have it shrink 80%. In cases where our basic indexes are then enabled against that content, the compressed content plus indexes can be smaller than the original source content.

So, don’t be confused by terms and spin. One person’s cache is another’s copy.

More importantly, there’s much more to consider than the “how many copies” in finding the best platform for your content applications — functionality, transaction consistency, performance, optimization, compression, productivity, and standardization all come into play.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.