Category Archives: XML server

Classifying Database Management Systems: Regular and NoSQL

Thanks to two major trends — DBMS specialization and the NoSQL movement — the database management systems space is generating more interest and more innovation than any time I can remember since the 1980s.  Ever since around 1990, when the relational database management system (RDBMS) became firmly established, IT has played DBMSroulette:  spin the wheel and use the DBMS on which the needle lands — Oracle, DB2, or SQL Server.  (If you think this trivializes things, not so fast:  a friend who was the lead DBMS analyst at a major analyst firm once quipped to me that this wheel-spinning was his job, circa 1995.)

Obviously, there was always some rational basis for DBMS selection — IBM shops tended to pick DB2, best-of-breed buyers liked Oracle, performance whizzes and finance types often picked Sybase, and frugal shoppers would choose SQL Server, and later MySQL — but there was no differentiation in the model.  All these choices were relational database management systems.

Over time, our minds became dulled to orthogonal dimensions of database differentiation:

  • The database model.  For years, we lived in the database equivalent world of Henry Ford’s Model T:  any model you want as long as it’s relational.
  • The potential for trade-offs in fundamental database-ness.  We became binary and religious about what it meant be a database management system and that attitude blinded us to some fundamental trade-offs that some users might want to make — e.g., trading consistency for scalability, or trading ACID transactions for BASE.

The latter is the domain of Brewer’s CAP theorem which I will not discuss today.  The former, the database model, will be the subject of this post.

Every DBMS has some native modeling element (NME). For example, in an RDBMS that NME is the relation (or table).  Typically that NME is used to store everything in the DBMS.  For example, in an RDBMS:

  • User data is stored in tables.
  • Indexes are implemented as tables which are joined back to the base tables.
  • Administration information is stored in tables.
  • Security is usually handled through tables  and joins.
  • Unusual data types (e.g., XML) are stored in “odd columns” in tables.  (If your only model’s a table, every problem looks like a column.)

In general, the more naturally the data you’re storing maps to the paradigm (or NME) of the database, the better things will work.  For example, you can model XML documents as tables and store them in an RDBMS, or you can model tables in XML and store them as XML documents, but those approaches will tend to be more difficult to implement and less efficient to process than simply storing tables in an RDBMS and XML documents in an XML server (e.g., MarkLogic).

The question is not whether you can model documents as tables or tables as documents.  The answer is almost always yes.  Thus, the better question is should you?  The most famous example of this type of modeling problem is the storage of hierarchical data in an RDBMS.  To quote this article on managing hierarchical data in MySQL:

Most users at one time or another have dealt with hierarchical data in a SQL database and no doubt learned that the management of hierarchical data is not what a relational database is intended for.

(Personally, I blame the failure of Microsoft’s WinFS on this root problem — file systems are inherently hierarchical — but that’s  a story for a different day.)

I believe the best way to classify DBMSs is by their native modeling element.

  • In hierarchical databases, the NME is the hierarchy.  Example:  IMS.
  • In network databases, it’s the (directed, acyclic) graph. Example:  IDMS.
  • In relational databases, it’s the relation (or, table).  Example:  Oracle.
  • In object databases, it’s the (typically C++) object class. Example:  Versant.
  • In multi-dimensional databases, it’s the hypercube. Example:  Essbase.
  • In document databases, it’s the document. Example:  CouchDB.
  • In key/value stores, it’s the key/value pair. Example:  Redis.
  • In XML databases, it’s the XML document. Example:  MarkLogic.

The biggest limitation of this approach is that classifying by model fails to capture implementation differences. Some examples:

  • I would classify columnar DBMSs (e.g., Vertica) as relational if they model data as tables, and key/value stores (e.g., Hbase) as such if they model data in key/value pairs.  This fails to capture the performance advantage that Vertica gets on certain data warehousing problems due to its column orientation.
  • I would classify all relational databases as relational, despite implementation optimizations.  For example, this approach fails to capture Teradata’s optimizations for large-scale data warehousing, Aster’s optimizations for analytics on big data, or Volt’s optimizations for what Curt Monash calls HVSP.
  • I would classify all XML databases as XML databases, despite possible optimization differences for the two basic XML use-cases:  (1) XML as message wrapper vs. (2) XML as document markup.

Nevertheless, I believe that DBMSs should be classified first by model and then sub-classified by implementation optimization.  For example, a relational database optimized for big data analytics (Aster).  An XML database optimized for large amounts of semi-structured information marked in XML (MarkLogic).

In closing, I’d say that we are seeing increasing numbers of customers coming to Mark Logic saying:  “well, I suppose we could have modeled this data relationally, but in our business we think of this information as documents and we’ve decided that it’s easier and more natural to manage it that way, so we decided to give you a call.”

After thinking about this for some time, I have one response:  keep calling!

No matter how you want to think about MarkLogic Server — an XML server, an XML database, or an XML document database — dare I say an [XML] [document] server|database  — it’s definitely a document-oriented, XML-oriented database management system and a great place to put any information that you think is more naturally modeled as documents.

Positioning MarkLogic Server

Here’s a great picture from our VP of engineering, Ron Avnur, on how he positions MarkLogic Server relative to other software categories. It’s an elegant and simple way of explaining where we fit.

The two dimensions are structure and query type. Structure can either be predefined or ad hoc (and often, in the document world, there is a predefined structure that no one actually uses, which is de facto ad hoc). Query types can either be predefined (i.e., known in advance) or ad hoc (i.e., not known in advance).

Let’s look at the quadrants that result:

  • Bottom left is where both structure and queries and predefined. Hierarchical DBMSs, like IMS, live in this quadrant. In these (now legacy) systems, the structure of the data is rigidly defined as are the queries that may be run against them. These databases provide high performance, but their inflexibility became their Achilles’ heel.
  • Bottom right is where structure is predefined but queries are ad hoc. The quadrant defines the relational database, which brought unprecedented flexibility to database querying, eventually enabling the modern BI market. Data structure is predefined through the creation of tables with defined names/columns to hold the data. Queries are ad hoc — in a well designed relational database, the system can provide the results for almost any imaginable query. (And with the right indexes, it can provide those results fairly quickly.)
  • Top left is where queries are predefined but structure is not. This — and this is non-obvious to most people — is the zone of the enterprise search engine. People tend to think of search engines as providing high flexiblity because you can type any word in the search box. In reality, seen from a database viewpoint, search engines provide a small number of parametrized queries. (It’s the parametrization that gives the impression of flexibility.) The small number of queries include (1) return list of documents where document contains word or phrase, (2) return list of documents where field-in-document contains word or phrase, (3) either query (1) or (2) where word or phrase is replaced with the search engine’s basically Boolean primitive query language (i.e., AND, OR, NOT).
  • The top right is the tricky zone where both queries and structure are not defined in advance. This is the zone of the XML Server, like MarkLogic. In these systems, content can be ingested “as is” without adherence to any predefined structure. Queries are ad hoc, and written in XQuery with full-text extensions. Given the proper indexes, these systems can run virtually any query against the content with high performance.

Hopefully this sheds some light on my soundbite that: “at Mark Logic, we are doing for (XML) documents what the relational database did for data.”

Related articles by Zemanta
Reblog this post [with Zemanta]

Mark Logic in EContent Magazine Dynamic Navigation Story

A rather overdue post to highlight that Mark Logic was featured a few months back in an EContent Magazine story entitled Reaping Information: Dynamic Navigation Helps Users (PDF).

Excerpts:

Delivering information in ways that make the most sense to users is a key characteristic of MarkLogic Server, an XML Server that allows users to store, manage, manipulate, and deliver information

Indeed, a key use-case for MarkLogic is as an information delivery platform. More:

Media company ALM uses MarkLogic Server for its enterprise content repository, which holds more than 2 decades worth of news and analysis for and about the legal market.

ALM was acquired by Incisive Media a while back but nevertheless remains a customer. More:

Oxford University Press has organized its reference works on African-Americans into a central repository it calls the African American Studies Center (AASC), which allows researchers the ability to search through images and articles, arranging them in chronological order.

AASC is not only a very cool MarkLogic-based application, but also — perhaps more importantly — it’s just one slice of Oxford’s content.

Once a publisher builds their content application platform, it is relatively easy to take different slices of their content to build new and different information products. For example, Oxford Islamic Studies Online (OISO) is built on the same platform as the AASC, and I’m sure the OISO’s marginal development cost was reduced because it could leverage the fixed costs invested the development of OUP’s (MarkLogic-based) publishing platform.