Pimp My Ride: Jacked-Up Relational vs. Special-Purpose DBMSs

One advantage of having kids is that they keep you in touch with pop culture. I’d never had heard of Simple Plan, Nickelback, or Green Day were it not for my kids. Such is also the case with MTV’s Pimp My Ride, the metaphor for today’s post.

On Pimp My Ride, mechanics take old, run-down cars, and make lots of cosmetic changes to them: they jack them up, add spoilers, change the suspension, plug in DVD players, swap the wheels, change the seats, and so on. At its core you have an old car, but it’s been “pimped” by bolting on lots of functionality.

You can’t help but think this is exactly what the relational database vendors have been doing of late. Recall that their car, the relational model, is over 30 years old. It was invented by IBM’s Ted Codd in late 1960s and first published in a paper in 1970. Oracle (nee Relational Software Inc.) was founded in 1977 and was the first company to implement the relational model in a commercial product.

In fact, when talking to Gartner database analyst Donald Feinberg some time ago, I recall that he said: “we don’t even call them RDBMSs anymore; we just call them DBMSs because they’ve long-since stopped being relational.”

This raises three key questions to me:

If the great enabler of the RDBMS revolution was, as Codd hoped, the injection of mathematical rigor into commercial database systems, then what effect will a decade of ad hoc, ungrounded changes have on the category?

Just because it may be possible to build an uber-DBMS that handles everything (e.g., data, content, streams, in-memory stores, multimedia, XML, hypercubes, aggregates), does it mean that the uber-DBMS is necessary or desirable?

How far can RDBMSs be stretched before they give way to a generation of special-purpose DBMSs that are not least common denominator solutions, but indeed optimized for specific DBMS challenges?

One way to analyze these questions is to ponder the origins of RDBMS both from the conceptual problem that Codd was trying to solve and from the design assumptions in place when RDBMSs were first implemented.

Conceptually, Codd was trying to separate application and database, arguing that data could be modeled and stored in an application-neutral way. Instead of each individual application defining its own customer information, customer information could be modeled independently, stored in the database, and accessed by the various applications that needed it.

Codd was also trying to inject non-procedurality through a query language that specified “what you wanted” and not “how to get it.” That would isolate applications from underlying data structures and enable the system to include an optimizer that would find the fastest way to process any given query (and that fastest way could indeed change over time as data distributions and parameters changed).

Finally, Codd was trying to enable real ad hoc queries. The primary database problem of Codd’s era was inflexibility. Hierarchical and network databases were reliable and fast transaction processing engines. But they were extremely inflexible with respect to ad hoc queries. There wasn’t any specific query that couldn't be answered in pre-relational databases. The catch was, however, that you had to know in advance which questions you wanted the answers to.

And Codd help you (pardon the pun) should you need to answer an unanticipated question. Worst case, doing so would require a re-design of the database, and a complete dump and reload. (How important was that question again?)

But let’s be clear, Codd was thinking about data -- good, old fields like: addresses, names, social security numbers, PO numbers, phone numbers and such. He was not thinking about documents, web content, hypercubes, XML, PDFs, videos, streams/real-time feeds, and such.

And let’s consider the computing environments in place when RDBMSs were first implemented.

A 256MB disk was considered big in 1985 (when I started using RDBMSs) and I think cost around $50K.

A minicomputer (e.g., a VAX) with 8-16MB of memory was considered loaded.

I took great pride in 1987 when I ran a 30-user call center application on a 1 MIPS, MicroVAX II with around 512MB of total storage and 16MB of memory. Sobering, I know. And that was already nearly a decade after Oracle and Ingres started their implementations.

You can argue that the DBMS vendors have done a good job adapting as assumptions around them changed (e.g., faster processors, more memory, more disk, SMP, shared-nothing clusters). But they still aren’t optimized for all of these changes; they happen too fast. Why do you think Oracle bought TimesTen (an in-memory DBMS) a while back?

I believe we are in the midst of a new era of special-purpose DBMSs. Consider these examples:

Stream DBMSs (e.g., Skyler, Exegy, and Streambase). Many of these run queries completely upside-down from the normal RDBMS approach – they flow streams of data through query predicates (i.e., restrictions) and then notify relevant standing queries of new possible results.

Memory-resident DBMSs a la TimesTen, designed with the assumption that the entire DBMS is in memory.

Multi-dimensional DBMSs (OLAP servers). While the RDBMS vendors have either bought, built, or (in the case of Oracle) both bought and built, multi-dimensional capabilities, these are mostly layers on top of an underlying relational core.

XML content servers, a la MarkLogic: special-purpose DBMSs optimized not for just XML, but specifically for content marked-up in XML.

XML databases: special-purpose DBMSs (e.g., Tamino, Ipedo) optimized for the storage of data marked-up in XML, often positioned as middle-tier persistent stores in inter-application communication applications (e.g., EII, EAI).

Data warehouse DBMSs, a la Teradata, Netezza, GreenPlum, or HyperRoll. These are DBMSs optimized specifically for the size/scale and query-intensive nature of data warehouses.

So the question is, e.g., if you’re building a content application like SafariU,

why use an uber-DBMS that includes a huge amount of functionality for what you aren't trying to do and, worse yet, isn’t optimized for what you are trying to do?

You can ask the same question for streams, warehouses, XML messages, or any other specific type of data that doesn't fit well into relational systems.

Pimp My Ride: Jacked-Up Relational vs. Special-Purpose DBMSs

Read more

Book Review: The Curious Case of Mike Lynch by Katie Prescott

Why I'm Joining the Board of Dreamdata

The Metrics Brothers Hiatus

A Diamond in the Rough: Startup Founder Survival Guide by David Politis