What's a Column-Oriented DBMS?

One of my memes is the rise of special-purpose database management systems (DBMSs). While I’m obviously a big believer in MarkLogic and XML content servers, I believe that XML content servers are simply one example of a whole new class of DBMSs, each designed and optimized for a specific purpose.
As a technologist, I believe this is because I think it’s neither necessary nor desirable (nor, I might add, possible) to infinitely extend the now quarter-century-old RDBMS.

  • Not necessary because federation and web services make it increasingly easy to dedicate special tasks to special servers.
  • Not desirable because the relational database has certain design assumptions that, while quite useful some applications, are wholly inappropriate for others.

Rather than riff again about XML content servers, I thought today I’d pick a different example of a special-purpose DBMS: the column-oriented database.
I’d recently heard that Michael Stonebraker had founded Vertica, a column-oriented DBMS company (complete with ten-cute-points slogan, “the tables have turned”). So I decided to try and figure out what column-oriented DBMS is and why you might want one. Here’s my answer.

First, look at the above picture, which depicts a row-oriented and a column-oriented DBMS. A row-oriented system stores rows together. A column-oriented system stores columns together. So what?
For people who care about the information in just one column, storing that information in a table with other columns reduces information density. That is, the information you care about is striped across rows loaded with (potentially lots) of non-useful information. (Imagine this when the rows are quite long.) That means it’s inherently inefficient to go pull a large number of values for one column.
But who only cares about information in one column? Wait a minute. Think about the fact table in a data warehouse. Now it all makes sense. Data warehouse queries often care a lot about one-column queries (e.g., the fact table in a star schema) and about joins between the fact table and the normalized dimension tables in a snowflake schema.
So a column-oriented database has a number of advantages here:

  • Higher information density means more efficiency in pulling answers off disk and/or in caching them in memory.
  • Storing the column-tables in row-id, value order (as opposed to value, row-id order) greatly accelerates the performance of sort-merge joins. Why? Because the column-tables you want to join are already sorted by row-id eliminating the costly need to sort before the merge. They claim this can accelerate join performance 100x.

So are you going to rip out Oracle from underneath your production SAP implementation and replace it with a column-oriented database? No. That’s not the point. Column-oriented databases, like other special-purpose DBMSs, are not general-purpose RDBMS replacements.
Special-purpose DBMSs are built to do one thing “really well” meaning either (1) doing something practically impossible in, or (2) do something 10 to 100 times faster than, an equivalent RDBMS implementation.
For more information on column-oriented DBMSs, check out the Wikipedia entry here.
Thanks to Mark Logician Ron Avnur for educating me on this topic.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.