Thanks to two major trends — DBMS specialization and the NoSQL movement — the database management systems space is generating more interest and more innovation than any time I can remember since the 1980s. Ever since around 1990, when the relational database management system (RDBMS) became firmly established, IT has played DBMSroulette: spin the wheel and use the DBMS on which the needle lands — Oracle, DB2, or SQL Server. (If you think this trivializes things, not so fast: a friend who was the lead DBMS analyst at a major analyst firm once quipped to me that this wheel-spinning was his job, circa 1995.)
Obviously, there was always some rational basis for DBMS selection — IBM shops tended to pick DB2, best-of-breed buyers liked Oracle, performance whizzes and finance types often picked Sybase, and frugal shoppers would choose SQL Server, and later MySQL — but there was no differentiation in the model. All these choices were relational database management systems.
Over time, our minds became dulled to orthogonal dimensions of database differentiation:
- The database model. For years, we lived in the database equivalent world of Henry Ford’s Model T: any model you want as long as it’s relational.
- The potential for trade-offs in fundamental database-ness. We became binary and religious about what it meant be a database management system and that attitude blinded us to some fundamental trade-offs that some users might want to make — e.g., trading consistency for scalability, or trading ACID transactions for BASE.
The latter is the domain of Brewer’s CAP theorem which I will not discuss today. The former, the database model, will be the subject of this post.
Every DBMS has some native modeling element (NME). For example, in an RDBMS that NME is the relation (or table). Typically that NME is used to store everything in the DBMS. For example, in an RDBMS:
- User data is stored in tables.
- Indexes are implemented as tables which are joined back to the base tables.
- Administration information is stored in tables.
- Security is usually handled through tables and joins.
- Unusual data types (e.g., XML) are stored in “odd columns” in tables. (If your only model’s a table, every problem looks like a column.)
In general, the more naturally the data you’re storing maps to the paradigm (or NME) of the database, the better things will work. For example, you can model XML documents as tables and store them in an RDBMS, or you can model tables in XML and store them as XML documents, but those approaches will tend to be more difficult to implement and less efficient to process than simply storing tables in an RDBMS and XML documents in an XML server (e.g., MarkLogic).
The question is not whether you can model documents as tables or tables as documents. The answer is almost always yes. Thus, the better question is should you? The most famous example of this type of modeling problem is the storage of hierarchical data in an RDBMS. To quote this article on managing hierarchical data in MySQL:
Most users at one time or another have dealt with hierarchical data in a SQL database and no doubt learned that the management of hierarchical data is not what a relational database is intended for.
(Personally, I blame the failure of Microsoft’s WinFS on this root problem — file systems are inherently hierarchical — but that’s a story for a different day.)
I believe the best way to classify DBMSs is by their native modeling element.
- In hierarchical databases, the NME is the hierarchy. Example: IMS.
- In network databases, it’s the (directed, acyclic) graph. Example: IDMS.
- In relational databases, it’s the relation (or, table). Example: Oracle.
- In object databases, it’s the (typically C++) object class. Example: Versant.
- In multi-dimensional databases, it’s the hypercube. Example: Essbase.
- In document databases, it’s the document. Example: CouchDB.
- In key/value stores, it’s the key/value pair. Example: Redis.
- In XML databases, it’s the XML document. Example: MarkLogic.
The biggest limitation of this approach is that classifying by model fails to capture implementation differences. Some examples:
- I would classify columnar DBMSs (e.g., Vertica) as relational if they model data as tables, and key/value stores (e.g., Hbase) as such if they model data in key/value pairs. This fails to capture the performance advantage that Vertica gets on certain data warehousing problems due to its column orientation.
- I would classify all relational databases as relational, despite implementation optimizations. For example, this approach fails to capture Teradata’s optimizations for large-scale data warehousing, Aster’s optimizations for analytics on big data, or Volt’s optimizations for what Curt Monash calls HVSP.
- I would classify all XML databases as XML databases, despite possible optimization differences for the two basic XML use-cases: (1) XML as message wrapper vs. (2) XML as document markup.
Nevertheless, I believe that DBMSs should be classified first by model and then sub-classified by implementation optimization. For example, a relational database optimized for big data analytics (Aster). An XML database optimized for large amounts of semi-structured information marked in XML (MarkLogic).
In closing, I’d say that we are seeing increasing numbers of customers coming to Mark Logic saying: “well, I suppose we could have modeled this data relationally, but in our business we think of this information as documents and we’ve decided that it’s easier and more natural to manage it that way, so we decided to give you a call.”
After thinking about this for some time, I have one response: keep calling!
No matter how you want to think about MarkLogic Server — an XML server, an XML database, or an XML document database — dare I say an [XML] [document] server|database — it’s definitely a document-oriented, XML-oriented database management system and a great place to put any information that you think is more naturally modeled as documents.
In the domain of documents, there is another model: information retrieval using a text-search engine. In this model it doesn’t matter whether the documents are in the XML format or some other.
Hi Dinesh,
Text search engines typically don’t care about document format which is a double-edge sword: they tend neither to be bothered by it, nor — however — can they exploit.
Search engines that use “filters” to access hundreds of different document types typically use those filters to filter away everything but the text — e.g., XML metadata about structure or semantics or just basic formatting information that still might have been useful in formulating queries.
The other huge issue is text search engines is their inability to express complex database-like queries. From a DBMS perspective, I have always viewed text search engines as “one trick ponies” — they run one query really well: return list of links to documents where document contains word|phrase.
But there are an infinite number of *other* queries that people might want to run.
At Mark Logic, we like text search — don’t get me wrong. We just think it should be deeply integrated into a database so you can seamlessly combine text, structural, and semantic metadata constraints in queries.
A key end-user complaint about enterprise search systems is that they want a consumer like search experience ie. they want a simple search interface to enterprise data (or the silo data they need access to) that also delivers their information need. This implies (as you say) that the search engine should look after all the filtering of different document formats. A new generation of search engines can utilize any available metadata. It is known that the Google Appliance group is making a concerted effort to crack the enterprise search market. This leaves support for database-like queries which is really what end-users want to get away from.
Databases of various types including XML db’s for core transactional systems will always be required. However, as the volume of unstructured data increases inexorably within enterprises they will eventually resort to dumping it all into a “search data warehouse” driven by advanced search engines.
DBMS-level support for processing complex queries is definitely what people want — no search engine today (with the possible exception of Mark Logic if you classified it as such) could possibly answer the kinds of questions people want to ask in a BI environment. So, no, people are not trying to get away from DBMS style queries and while they may wish to get away from the classical BI interfaces (which btw are a $2B/year business) that generate them, trust me they will want to ask complex questions which means an engine capable of processing them.
So separate the UI from the DBMS as a first step. And, I think we agree the DBMSs that handle those queries will need to look like search engines on the inside.
Typing fast, sorry if that’s not too coherent.
Hi Dinesh,
Text search engines typically don’t care about document format which is a double-edge sword: they tend neither to be bothered by it, nor — however — can they exploit.
Search engines that use “filters” to access hundreds of different document types typically use those filters to filter away everything but the text — e.g., XML metadata about structure or semantics or just basic formatting information that still might have been useful in formulating queries.
The other huge issue is text search engines is their inability to express complex database-like queries. From a DBMS perspective, I have always viewed text search engines as “one trick ponies” — they run one query really well: return list of links to documents where document contains word|phrase.
But there are an infinite number of *other* queries that people might want to run.
At Mark Logic, we like text search — don’t get me wrong. We just think it should be deeply integrated into a database so you can seamlessly combine text, structural, and semantic metadata constraints in queries.