Square Pegs and Round Holes

They say if your only tool's a hammer, then every problem looks like a nail. That's certainly a good way to think about how relational DBMS vendors have approached content.

After all, if you're running a multi-billion dollar database business, then you are going to have two built-in biases when thinking about content: (1) you will want to make content look like data and (2) you will assume the only thing people want to do with content is integrate it with data.

This posting addresses the first point -- making content look like data. (Another one will explore the second.)

If your world revolves around square tables, then it's not surprising that your reaction to the content problem is to find a way to squeeze documents into tables. The most obvious approach is to shove them into a column. That's exactly what the RDBMS vendors did, first with BLOBs, and later with CLOBs. Using this approach a document is placed into a table that looks like a bit of metadata plus the document itself, such as bunch-of-docs(doc-id, author, title, date, document).

Given that you can do virtually nothing with a BLOB, and fairly little with a CLOB, most customers who want to do text searches will use an enterprise search engine to index the documents in the database.

The search engine spider walks through the table, and indexes the words in each document. Instead of the search engine index saying that you can find documents containing "mojo" at [list-of-places] on the file system, or at [list-of-URLs] on the Internet, the index will say that you can find them at [list-of-doc-ids]. The documents can be materialized by passing a SQL query the list of doc-ids.

Over time the DBMS vendors have added some text indexing to their products, but customers seem to find it inferior to that provided by enterprise search engines. Most of the customers we work with are not using DBMS-provided text indexing. They are trying to use DBMSs and search engines together. Most of them aren't very happy because they are struggling to get two different technologies to work together, they are doing a lot of preprocessing to get the indexing capabilities they want, and they are licensing two products and paying two maintenance bills when one should do.

The cost and complexity of this approach is, in my opinion, one key reason why 80% of content still lives in files on the file system and not in DBMSs.

That's the state of the market today. Let's talk about the future.

Most of the database vendors have announced XML stories already. Their early approaches involved "shredding" XML documents to tables. That works fine for XML data where you have N instances of the same schema that maps very nicely to a table (think: a stream of purchase orders). It doesn't work at all for documents, where the schema is unknown and time-varying, and where "round-tripping" of a document (taking it to and from the database without alteration) is an absolute requirement.

Given these limitations, the RDBMS vendors are currently working on their second-take XML stories, which are generally all the same, with IBM having one difference.

What's the same? (1) they are all re-using the abstract data type (ADT) approach that was introduced in "universal" databases and largely ignored by the market, and (2) they are all providing a way to embed XQuery inside SQL.

ADTs hearken back to the early days of Postgres, which was the first DBMS to include abstract data types. I was the product marketing manager for Ingres 6.3, the first commercial DBMS to include abstract data types. The idea was to handle new types in RDBMSs by letting people plug them into the server. Ingres 6.3 was launched around 1990. While ADTs were great at generating hype for the marketing launch, virtually no one used them thereafter. Ingres slowly faded away through two acquisitions (first ASK, then CA) and everyone wondered if a good technology had gone down with a bad ship.

Ingres founder Michael Stonebreaker then created Illustra, renamed ADTs to the catchier "datablades," and tried the same thing all over again. Informix snatched up the nascent Illustra for about $400M, thus launching the enormous "universal database" hype of the late 1990s.

When the smoke cleared, every vendor had a universal database with abstract data types. But no one used them – not for storing complex types, and certainly not for storing documents. At XML 2004 a few years back, I watched panelists from the major vendors twitch uncomfortably when asked by an audience member: “if you’re basing your future support for XML on the failed ADT technology of the past, then I have one question -- why should I believe you this time?”

As for embedding XQuery inside SQL, I think (1) that it’s really ugly – go look at some of the samples, and (2) that it is probably useful for some cases of data/content integration, but most certainly not the answer for applications that are purely content-based. In short, why would you want a relational superstructure around a list of documents stored in a table that looks like bunch-of-docs(doc-id, document) and accessed with queries that look like select * from bunch-of-docs where 1 = exists (embedded XQuery statement)?

What's different amongst the RDBMS vendors’ approaches to XML? IBM has announced a "dual core" DBMS which has both a native relational and a native XML storage manager. Microsoft and Oracle, on the other hand, will be both implementing on top of their existing relational storage managers. (As of this writing, none of the second-take products are shipping, by the way. I'm commenting on announced directions and what I've seen in Betas and demos.)

We’ll see which approach works better for them in solving the problem of how to shove documents into relational databases.Having a lot of experience in both relational and XML document management, my money’s on IBM.

But the question I’ll ask is – why shove at all?If you are working with a collection of documents in your application, then why use a server whose data model is based around “table” as its core concept?

Why not use a server whose data model is based around “document” as its core concept?

That’s exactly what we make at Mark Logic:a content server whose native data model is “document” and whose native query language is XQuery. We think it’s a far better paradigm for people with document- or content-based application development problems.

Square Pegs and Round Holes

Read more

Book Review: The Curious Case of Mike Lynch by Katie Prescott

Why I'm Joining the Board of Dreamdata

The Metrics Brothers Hiatus

A Diamond in the Rough: Startup Founder Survival Guide by David Politis