Category Archives: Michael Stonebraker

Yes, Virginia, MarkLogic is a NoSQL System

The other day I noticed a taxonomy used on one of the NoSQL Database blogs that went like this:

Types of NoSQL systems

  • Core NoSQL Systems
    • Wide column stores
    • Document stores
    • Key-value / tuple stores
    • Eventually consistent key-value stores
    • Graph databases
  • Soft NoSQL Systems (not the original intention …)
    • Object databases
    • Grid database solutions
    • XML databases
    • Other NoSQL-related databases

I, perhaps obviously, take some umbrage at having MarkLogic (acceptably classified as an XML database) being declared “soft NoSQL.”  In this post I’ll explain why.

Who decided that being open source was a requirement to be real NoSQL system?  More importantly, who gets to decide?  NoSQL – like the Tea Party – is a grass-roots, effectively leaderless movement towards relational database alternatives.  Anyone arguing original intent of the founders is misguided because there is no small group of clearly identified founders to ask.  In reality, all you can correctly argue is what you think was the intent of the initial NoSQL developers and early adopters, or — perhaps more customarily — why you were drawn to them yourself, disguised or confused as original founder intent.

As mentioned here, movements often appear homogeneous when they are indeed heterogeneous.  What looks like a long line of demonstrators protesting a single cause is in fact a rugby scrum of different groups pushing in only generally aligned directions.  For example, for each of the following potential motivations, I am certain that I can find some set of NoSQL advocates that are motivated by it:

  • Anger at Oracle’s heavy-handed licensing policies
  • The need to store unstructured or semi-structured data that doesn’t fit well into relations
  • The impedance mismatch with relational databases
  • A need and/or desire to use open source
  • An attempt to reduce total cost
  • A desire to land at a different point in the Brewer CAP Theorem triangle of consistency, availability, and partition tolerance
  • Coolness / wannabe-ism, as in, I want to be like Google or Facebook

(Since this was a source of confusion in prior posts, note that this is not to claim the inverse:  that all NoSQL advocates are motivated by all of the possible motivations.)

I’d like to advocate a simple idea:  that NoSQL means NoSQL.  That a NoSQL system is defined as:

A structured storage system that is not based on relational database technology and does not use SQL as its primary query language

In short, my proposed definition means that NoSQL (broadly) = NoSQL (literally) + NoRelational.  In short:  relational database alternatives.  It does not mean:

  • NoDBMS.  We should not take NoSQL to exclude systems we would traditionally define as DBMSs.  For example, supporting ACID transactions or supporting a non-SQL query language (e.g., XQuery) should not be exclusion criteria for NoSQL.
  • NoCommercialSoftware.  While many of the flagship NoSQL projects (e.g., Hadoop, CouchDB) are open source projects, that should be not a defining criterion.  NoSQL should be a technological, not a delivery- or business-model, classification.  Technology and delivery model are orthogonal dimensions.   We should be able to speak of traditionally licensed, open source licensed, and cloud-hosted NoSQL systems if for no other reason than understanding the nuances of the various business/delivery models is a major task unto itself.  Do you mean open source or open core?  Is it open source or faux-pen source?  Under which open source license?  How should I think of a hosted subscription service that is a based on or a derivative of an open source project?

Recently, I’ve heard a piece of backpeddling that I’ve found rather irritating:  that NoSQL was never intended to mean “no SQL,” it was actually intended to mean “not only SQL.”  Frankly, this strikes me as hogwash:  uh oh, I’m afraid that people are seeing us as disruptors and it’s probably easier to penetrate the enterprise as complementary, not competitive, so let’s turn what was a direct assault into a flanking attack.

To me, it’s simple:  NoSQL means NoSQL.  No SQL query language and no relational database management system.  Yes, it’s disruptive and — by some measures — “crazy talk” but no, we shouldn’t hide because there are lots of perfectly valid (and now socially acceptable) reasons to want to differ from the relational status quo.

In effect, my definition of NoSQL is relational database alternative.  Such options include both alternative databases (e.g., MarkLogic) and database alternatives (e.g., key/value stores).  This, of course, then cuts at your definition of database management system where I (for now at least) still require the support of a query language and the option to have ACID transactions.

By the way, I understand the desire to exclude various bandwagon-jumpers from the NoSQL cause.  Like most, I have no interest in including thrice-reborn object databases in the discussion, but if the cost of excluding them is excluding systems like MarkLogic then I think that cost is too high.  Many people contemplating the top-of-mind NoSQL systems (e.g., Hadoop) could be better served using MarkLogic which addresses many typical NoSQL concerns, including:

  • Vast scale
  • High performance
  • Highly parallel shared-nothing clusters
  • Support for unstructured and semi-structured data

All with all the pros (and cons) of being a commercial software package and without requiring reduced consistency:  losing a few Tweets won’t kill Twitter, but losing a few articles, records, or individuals might well kill a patient, bank, or counter-terrorism agency.  BASE is fine for some; many others still need ACID.  Michael Stonebraker has some further points on this idea in this CACM post.

I’d like to suggest that we should combine the ideas in this post with the ideas in my prior one, Classifying Database Management Systems.  That post says the correct way to classify DBMSs is by their native modeling element (e.g., table, class, hypercube).  This post says that NoSQL is semi-orthogonal – i.e., I can imagine a table-oriented database that doesn’t use SQL as its query language, but I doubt that any exist.  Applying my various rules, the combined posts say that:

  • Aster is a SQL database optimized for analytics on big data
  • MarkLogic is an XML [document] database optimized for large quantities of semi-structured information and a NoSQL system
  • CouchDB is a document database and a NoSQL system
  • Reddis is a key/value store and a NoSQL system
  • VoltDB is a SQL database optimized to solve one of the two core problems that NoSQL systems are built for (i.e., high-volume simple processing)

Finally, I’d conclude that even with these rules I have trouble classifying MarkLogic because of multiple inheritance:  MarkLogic is both a document database and an XML database, it is difficult to pick one over the other, and I there certainly are non-document-oriented XML database systems.   Similar issues exist with classifying the various hybrids of document databases and key/value stores.  So while I may have more work to do on building an overall taxonomy, I am absolutely sure about one thing:  MarkLogic is a NoSQL system.


* The “Yes, Virginia” phrase comes from a 1897 story in the New York Sun.  For more, see here.

The Database Tea Party: The NoSQL Movement

Adam Smith’s invisible hand never rests.  Just five years ago, the database market looked like a static, three-player $10B/year oligopoly where the primary forces were inertia and profit-taking.  Today, we have two major forces disrupting the comfortable stasis that has developed over the past 30 years.

  • One force is DBMS specialization:  while the general-purpose RDBMS is useful for a broad range of applications, it is optimal for few of them.  The RDBMS has slowly become expensive bloatware that is functionally a jack of all trades, master of none.  MIT’s Michael Stonebraker calls the RDBMS a one size fits all solution.
  • The other force is NoSQL, an organic and rapidly-growing industry movement away from relational databases, driven by a number of factors including both technology and cost.

The purpose of this post is to share my thoughts on NoSQL.  Make no mistake, like the Tea Party Movement, NoSQL is a rebellion; just look at the name.  But like most demonstrations, not everyone is marching for the same reasons.  Here are some of the things I think various members of the NoSQL crowd are marching against:

  • Table-oriented, 1960s-era database technology:  RDBMSs were designed for handling data and short-text fields, necessitate mapping programmatic objects to tables (i.e., the impedance mismatch), and require the use of an increasingly stone-age query language, SQL.
  • Scalability:  relational databases were not designed to handle and do not generally cope well with Internet-scale, “big data” applications.  Most of the big Internet companies (e.g., Google, Yahoo, Facebook) do not rely on RDBMS technology for this reason.
  • High prices and the heavy-handed treatment of customers:  both stem from the underlying oligopoly and the lack of credible alternative suppliers
  • Closed source:  the inability to customize the internals of the DBMS engine to meet specific needs
  • Bloatware:  ironically that while RDBMSs are perceived as light in requirements that matter (e.g., scalability), they are  also seen as over-engineered for features that don’t.  (ACID transactions are a favorite target in this department.)
  • DBA supremacy.  For years, corporate DBAs called the shots on where strategic data assets would be stored, and thus how they would be accessed.  This created headaches for the programmers of the world who, in response, have done as much as possible to abstract away the database (e.g., Ruby on Rails).

On the flip side, there are things the NoSQL crowd are fighting for:

  • Open source, implying control.  The ability that open source software provides to customize product functionality.
  • Open source, implying free.  The often-flawed notion that the absence of software license fees results in a reduced lifetime cost of ownership.
  • Coolness, or the “I want to be like Google” effect.  If Google’s got BigTable,  Yahoo’s got Hadoop, and Facebook’s got Cassandra, then we should build our own, too.  Our app is hard; we’re smart guys, too.
  • Vengeance, or the “I’m so mad at Oracle that I’ll do anything” effect.  Yes, some folks are just plain mad enough at Oracle to either go write their own DBMS, or take on the support of a very low-level infrastructure technology.

So, if you’re considering a NoSQL solution — a class in which I include MarkLogic — you need to figure out what you’re marching against, what you’re fighting for, and ultimately what will meet your needs at the lowest total cost of ownership.

My first recommendation to detect and, where applicable, kill off the coolness effect.  Google is swimming in money and PhDs.  They can build anything they want regardless of whether they should and, right or wrong,  for Google it just doesn’t matter.  So unless you have Google’s business model and talent pool, you probably shouldn’t copy their development tendencies.

Heck, I get the coolness attraction.  I think infrastructure software is cool, too.  That’s why I was an OS geek early on and have spent my career around databases.  But I surely don’t think that F1000 companies and government agencies should build their own DBMSs, nor fall into the trap of thinking that open source low-level stores are a free and easy way to avoid Oracle license fees.  Cool shouldn’t be in the equation.  Technology suitability and total cost should be.  Period.

My second recommendation is to orthogonalize the open source question, making it independent of functional requirements.  (This breaks if source customization is a requirement, but remember that requirement is often fictional:  most open source users don’t customize.)  If you’re struggling with an RDBMS on a given application problem you shouldn’t say:  we need an open source, NoSQL type thing.  You should say:  we need to look at relational database alternatives.  Those alternatives include a open source database projects (e.g., MongoDB, CouchDB) and distributed computing frameworks (e.g., Hadoop), but they also include commercial software offerings such as specialized DBMSs like Streambase (for real-time streams), Aster (for analytics on big data), and MarkLogic (for semi-structured data).  Don’t throw out the commercial-software-benefits baby with the RDBMS bathwater.

My personal take on this issue is that:

  • Relational databases, like the mainframe in 1985,  are entering the Autumn of their lives.  They won’t die quickly and mainframe isn’t dead today, but their best days are behind them.
  • Our kids will see SQL the way we see COBOL.  Some people can’t stand when I say this, but I think they’re in denial.  There is no logical reason to assume that the relational database and the SQL language are the endpoints in database evolution.  Yes, Larry Ellison is powerful.  But Adam Smith is more so.
  • Our kids will see no data/document dichotomy.  They will just see digital information.  We need to understand and remember that the data/document dichotomy is an artifact of the limitations of the tools and technologies with which we grew up.
  • Some of the NoSQL hype is an over-reaction to the database oligopoly.  I believe there are organizations out there who should be using alternative commercial databases, but instead are using open source NoSQL-type projects due to coolness, anger, or a mistaken belief that open source always has a lower total cost of ownership.  I believe rationality will return to these people.  One day management will say:  “Holy cow!  Why in the world are we paying programmers to write and support software at this low a level?”  (This is potentially avoidable if you can mentally project yourself into the future now and imagine how you will look back at the coming three years.)
  • Some of the NoSQL hype is a valid reaction to the technological limits of relational databases and the impedance mismatch in programming on them.

In the end, I think it’s great that the NoSQL movement is happening.  It’s awakening people to traditional RDBMS alternatives.  It’s making people understand that they don’t have to write big checks for commodity software.  It’s helping people solve problems that they can’t solve, or solve efficiently, on relational technology.

My axe to grind is simple:  just because you’re throwing out Oracle, don’t throw out all DBMSs and all commercial software with it.  Take a breath.  Look at all your alternatives.  Study total costs and technology applicability.  And make your best decision.

Interesting Writings on NoSQL

XML: YAFF, YADT, or Whole World?

If you have a bunch of XML and are looking for of a place to put it, then I think I may have come up with a simple test that might be helpful.

In talking with prospective vendors of XML repositories (definition: software that lets you store, search, analyze and deliver XML), try to establish what I’ll call “XML vision compatibility.” Quite simply, try to figure out if the vendor’s vision of XML is consistent with your own. To help with that exercise, I’ll define what I see as the three common XML vendor visions:

  • YAFF (yet another file format)
  • YADT (yet another data type)
  • Whole world

YAFF Vendors
Vendors with the YAFF vision view XML as yet another file format. ECM vendors clearly fall into this category (“oh yes, XML is one of the 137 file formats you can manage in our system”). So do enterprise search vendors (“oh yes, we have filters for XML formatted files which clear out all those nasty tags and feed our indexing engine the lovely text.”)

For example, let’s look at how EMC Documentum — one of the more XML-aggressive ECM vendors — handles XML on its website.

Hmm. There’s no XML on that page. But lots of information about records management, digital asset management, document capture, collaboration and document managent (it’s not there either). Gosh, I wonder where it is? SAP integration? Don’t think so. Hey, let’s try Documentum Platform, whatever that is.

Not there, either. Now that’s surprising because I really have no idea where else it might be. Oh, wait a minute. I didn’t scroll the page down. Let’s try that.

There we go. We finally found it. I knew they were committed to XML. What’s going on here is that EMC has a huge, largely vendor consolidation-driven (e.g., Documentum, Captiva, Document Sciences, x-Hive, Kazeon) vision of what content management is. And XML is just one tiny piece of that vision. XML is, well, yet another file format among the scores that they have manage, archive, capture, and provide workflow, compliance, and process management against. The vision isn’t about XML. It’s about content. That’s nice if you have an ECM problem (and a lot of money to solve it); t’s not so nice if you have an XML problem, or more precisely a problem that can be solved with XML.

YADT Vendors
Vendors with the YADT vision view XML as yet another data type. These are the relational database management system vendors (e.g., Oracle) who have decided that the best way to handle XML is to make it a valid datatype for a column in a table.

The roots of this approach go back to the late 1980s and Ingres 6.3 (see this semi-related blast from the past) which was the first commercial DBMS to provide support for user-defined datatypes. All the primitives for datatyping were isolated from the core server code and made extensible through standard APIs. So, for example, if you wanted to store complex numbers of the form (a, bi) all you had to do was to write some primitives so the server would know:

  • What they look like — i.e., (a, bi)
  • Any range constraints (the biggest, the smallest)
  • What operators should be available (e.g., +, -)
  • How to implement those operators — (a, bi) + (c, di) = (a+c, (b+d)i)

It was — far as I remember — yet another clever idea from the biggest visionary in database management systems after Codd himself: Michael Stonebraker then of UC Berkeley and now of MIT. After founding Ingres, Stonebraker went on found Illustra which was all about “datablades” — a sexy new name for user-defined types. Datablades, in turn, became sexy bait for Informix to buy the company with an eye towards leveraging the technology towards unseating Oracle from its leadership position. It didn’t happen.

User-defined datatypes basically didn’t work. There were two key problems:

  • You had user-written code running in the same address space as the database server. This made it nearly impossible to determine fault when the server crashed. Was it a database server bug, or did the customer cause problem in implementing a UDT? While RDBMS customers were well qualified to write applications and SQL, writing server-level was quite another affair. This was a bad idea.
  • Indexing and query processing performance. It’s fairly simple to say that, for example, a text field looks like a string of words and the + operator means concatenate. It’s basically impossible for a end customer to tell the query optimizer how to process queries involving those text fields and how to build indexes that maximize query performance. If getting stuff into UDTs was a level-5 challenge, getting stuff back out quickly was a level-100 one.

So while the notion of end users adding types to a DBMS basically failed, when XML came along the database vendors dusted off this approach, in saying effectively: let use all those hooks we put in to build support for XML types ourselves. And they did. Hence what I call the “XML column” approach to storing XML in a relational database.

After all, if your only data modeling element’s a table, then every problem looks like a column.

Now this approach isn’t necessarily bad. If, for example, you have a bunch of resumes and want to store attribute data in columns (e.g., name, address, phone, birthdate) and keep an XML copy of the resume alongside, then this might be a reasonable way to do things. That is, if you have a lot of data and a touch of XML, this may be the right way to do things.

So again, it comes down to vision alignment. If XML is just another type of data that you want to store in a column, then this might work for you. Bear in mind you’ll:

  • Probably have to setup separate text and pre-defined XML path indexes (a hassle on regular schemas, an impossibility on irregular ones),
  • Face some limitations in how those indexes can be combined and optimized in processing queries,
  • Need to construct frankenqueries that mix SQL and XQuery, whose mixed-language semantics are sometimes so obscure that I’ve seen experts argue for hours about what the “correct” answer for a given queries is,
  • And suffer from potentially crippling performance problems as you scale to large amounts of XML.

But if those aren’t problems, then this approach might work for you.

This is what it looks like when a vendor has a YADT vision. Half the fun in storing XML in an RDBMS is figure out which query language and which store options you want to use. See the table that starts on page 9, spans four pages, and considers nearly a dozen criteria to help you decide which of the three primary storage options you should use:

See this post from IBM for more Oracle-poking on the complexity of storage options available. Excerpt:

Oracle has long claimed that the fact that Oracle Database has multiple different ways to store XML data is an advantage. At last count, I think they have something like seven different options:

  • Unstructured
  • XML-Object-Relational, where you store repeating elements in CLOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as LOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as nested tables
  • XML-Object-Relational, where you store repeating elements in VARRAY as XMLType pointers to BLOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as XMLType pointers to nested tables
  • XML-Binary

Their argument is that XML has diverse use cases and you need different storage methods to handle those diverse use cases. I don’t know about you, but I find this list to be a little bewildering. How do you decide among the options? And what happens if you change your mind and want to change storage method?

Such is life in the land of putting XML in tables because your database management system has columns.

Whole World Vendors
Vendors with the whole world vision view XML as, well, their whole world.

And when I say XML, I don’t mean information that’s already in XML. I mean information that is either already in XML (e.g., documents, information in any horizontal or industry-specific XML standard) or that is best modeled in XML (e.g., sparse data, irregular information, semi-structured information, information in no, multiple, and/or time-varying schemas).

“Whole world” vendors don’t view XML as one format, but as a plethora: docbook, DITA, s1000d, xHMTL, TEI, XBRL, the HL7 standards in healthcare, the Acord standards in insurance, Microsoft’s Open Office XML format, Open Document Format, Adobe’s IDML, chemical markup lanuage, MathML, the DoD’s DDMS metadata standard, semantic web standards like RDF and OWL, and scores of others.

Whole world vendors don’t view XML tags as “something that get in the way of the text” and thus they don’t provide filters for XML files. Nor do they require schema adherence because they know that XML schema compliance, in real life, tends to be more of an aspiration than a reality. So they allow you load and index XML, as is, avoiding the first step’s a doozy problem, and enabling lazy clean-up of XML information.

Whole world vendors don’t try to model XML in tables simple because they have a legacy tabular data model. Instead, their native modeling element (NME) is the XML document. That is:

  • In a hierarchical DBMS the NME is the hierarchy
  • In a network DBMS the NME is the graph
  • In a relational DBMS the NME is the table
  • In an object DBMS the NME is the object class hierarchy
  • In an OLAP, or multi-dimensional, DBMS the NME is the hypercube
  • And in an XML server, or native XML, DBMS the NME is the XML document

Whole world vendors don’t bolt a search engine to a DBMS because they know XML is often document-centric, making search an integral function, and requiring a fundamentally hybrid search/database — as opposed to a bolted-together search/database — approach.

Here is what it looks like when you encounter a whole world vendor:

Reblog this post [with Zemanta]