Thanks to two major trends — DBMS specialization and the NoSQL movement — the database management systems space is generating more interest and more innovation than any time I can remember since the 1980s. Ever since around 1990, when the relational database management system (RDBMS) became firmly established, IT has played DBMSroulette: spin the wheel and use the DBMS on which the needle lands — Oracle, DB2, or SQL Server. (If you think this trivializes things, not so fast: a friend who was the lead DBMS analyst at a major analyst firm once quipped to me that this wheel-spinning was his job, circa 1995.)
Obviously, there was always some rational basis for DBMS selection — IBM shops tended to pick DB2, best-of-breed buyers liked Oracle, performance whizzes and finance types often picked Sybase, and frugal shoppers would choose SQL Server, and later MySQL — but there was no differentiation in the model. All these choices were relational database management systems.
Over time, our minds became dulled to orthogonal dimensions of database differentiation:
- The database model. For years, we lived in the database equivalent world of Henry Ford’s Model T: any model you want as long as it’s relational.
- The potential for trade-offs in fundamental database-ness. We became binary and religious about what it meant be a database management system and that attitude blinded us to some fundamental trade-offs that some users might want to make — e.g., trading consistency for scalability, or trading ACID transactions for BASE.
The latter is the domain of Brewer’s CAP theorem which I will not discuss today. The former, the database model, will be the subject of this post.
Every DBMS has some native modeling element (NME). For example, in an RDBMS that NME is the relation (or table). Typically that NME is used to store everything in the DBMS. For example, in an RDBMS:
- User data is stored in tables.
- Indexes are implemented as tables which are joined back to the base tables.
- Administration information is stored in tables.
- Security is usually handled through tables and joins.
- Unusual data types (e.g., XML) are stored in “odd columns” in tables. (If your only model’s a table, every problem looks like a column.)
In general, the more naturally the data you’re storing maps to the paradigm (or NME) of the database, the better things will work. For example, you can model XML documents as tables and store them in an RDBMS, or you can model tables in XML and store them as XML documents, but those approaches will tend to be more difficult to implement and less efficient to process than simply storing tables in an RDBMS and XML documents in an XML server (e.g., MarkLogic).
The question is not whether you can model documents as tables or tables as documents. The answer is almost always yes. Thus, the better question is should you? The most famous example of this type of modeling problem is the storage of hierarchical data in an RDBMS. To quote this article on managing hierarchical data in MySQL:
Most users at one time or another have dealt with hierarchical data in a SQL database and no doubt learned that the management of hierarchical data is not what a relational database is intended for.
(Personally, I blame the failure of Microsoft’s WinFS on this root problem — file systems are inherently hierarchical — but that’s a story for a different day.)
I believe the best way to classify DBMSs is by their native modeling element.
- In hierarchical databases, the NME is the hierarchy. Example: IMS.
- In network databases, it’s the (directed, acyclic) graph. Example: IDMS.
- In relational databases, it’s the relation (or, table). Example: Oracle.
- In object databases, it’s the (typically C++) object class. Example: Versant.
- In multi-dimensional databases, it’s the hypercube. Example: Essbase.
- In document databases, it’s the document. Example: CouchDB.
- In key/value stores, it’s the key/value pair. Example: Redis.
- In XML databases, it’s the XML document. Example: MarkLogic.
The biggest limitation of this approach is that classifying by model fails to capture implementation differences. Some examples:
- I would classify columnar DBMSs (e.g., Vertica) as relational if they model data as tables, and key/value stores (e.g., Hbase) as such if they model data in key/value pairs. This fails to capture the performance advantage that Vertica gets on certain data warehousing problems due to its column orientation.
- I would classify all relational databases as relational, despite implementation optimizations. For example, this approach fails to capture Teradata’s optimizations for large-scale data warehousing, Aster’s optimizations for analytics on big data, or Volt’s optimizations for what Curt Monash calls HVSP.
- I would classify all XML databases as XML databases, despite possible optimization differences for the two basic XML use-cases: (1) XML as message wrapper vs. (2) XML as document markup.
Nevertheless, I believe that DBMSs should be classified first by model and then sub-classified by implementation optimization. For example, a relational database optimized for big data analytics (Aster). An XML database optimized for large amounts of semi-structured information marked in XML (MarkLogic).
In closing, I’d say that we are seeing increasing numbers of customers coming to Mark Logic saying: “well, I suppose we could have modeled this data relationally, but in our business we think of this information as documents and we’ve decided that it’s easier and more natural to manage it that way, so we decided to give you a call.”
After thinking about this for some time, I have one response: keep calling!
No matter how you want to think about MarkLogic Server — an XML server, an XML database, or an XML document database — dare I say an [XML] [document] server|database — it’s definitely a document-oriented, XML-oriented database management system and a great place to put any information that you think is more naturally modeled as documents.