Category Archives: information infrastrutcure

Yes, Virginia, MarkLogic is a NoSQL System

The other day I noticed a taxonomy used on one of the NoSQL Database blogs that went like this:

Types of NoSQL systems

  • Core NoSQL Systems
    • Wide column stores
    • Document stores
    • Key-value / tuple stores
    • Eventually consistent key-value stores
    • Graph databases
  • Soft NoSQL Systems (not the original intention …)
    • Object databases
    • Grid database solutions
    • XML databases
    • Other NoSQL-related databases

I, perhaps obviously, take some umbrage at having MarkLogic (acceptably classified as an XML database) being declared “soft NoSQL.”  In this post I’ll explain why.

Who decided that being open source was a requirement to be real NoSQL system?  More importantly, who gets to decide?  NoSQL – like the Tea Party – is a grass-roots, effectively leaderless movement towards relational database alternatives.  Anyone arguing original intent of the founders is misguided because there is no small group of clearly identified founders to ask.  In reality, all you can correctly argue is what you think was the intent of the initial NoSQL developers and early adopters, or — perhaps more customarily — why you were drawn to them yourself, disguised or confused as original founder intent.

As mentioned here, movements often appear homogeneous when they are indeed heterogeneous.  What looks like a long line of demonstrators protesting a single cause is in fact a rugby scrum of different groups pushing in only generally aligned directions.  For example, for each of the following potential motivations, I am certain that I can find some set of NoSQL advocates that are motivated by it:

  • Anger at Oracle’s heavy-handed licensing policies
  • The need to store unstructured or semi-structured data that doesn’t fit well into relations
  • The impedance mismatch with relational databases
  • A need and/or desire to use open source
  • An attempt to reduce total cost
  • A desire to land at a different point in the Brewer CAP Theorem triangle of consistency, availability, and partition tolerance
  • Coolness / wannabe-ism, as in, I want to be like Google or Facebook

(Since this was a source of confusion in prior posts, note that this is not to claim the inverse:  that all NoSQL advocates are motivated by all of the possible motivations.)

I’d like to advocate a simple idea:  that NoSQL means NoSQL.  That a NoSQL system is defined as:

A structured storage system that is not based on relational database technology and does not use SQL as its primary query language

In short, my proposed definition means that NoSQL (broadly) = NoSQL (literally) + NoRelational.  In short:  relational database alternatives.  It does not mean:

  • NoDBMS.  We should not take NoSQL to exclude systems we would traditionally define as DBMSs.  For example, supporting ACID transactions or supporting a non-SQL query language (e.g., XQuery) should not be exclusion criteria for NoSQL.
  • NoCommercialSoftware.  While many of the flagship NoSQL projects (e.g., Hadoop, CouchDB) are open source projects, that should be not a defining criterion.  NoSQL should be a technological, not a delivery- or business-model, classification.  Technology and delivery model are orthogonal dimensions.   We should be able to speak of traditionally licensed, open source licensed, and cloud-hosted NoSQL systems if for no other reason than understanding the nuances of the various business/delivery models is a major task unto itself.  Do you mean open source or open core?  Is it open source or faux-pen source?  Under which open source license?  How should I think of a hosted subscription service that is a based on or a derivative of an open source project?

Recently, I’ve heard a piece of backpeddling that I’ve found rather irritating:  that NoSQL was never intended to mean “no SQL,” it was actually intended to mean “not only SQL.”  Frankly, this strikes me as hogwash:  uh oh, I’m afraid that people are seeing us as disruptors and it’s probably easier to penetrate the enterprise as complementary, not competitive, so let’s turn what was a direct assault into a flanking attack.

To me, it’s simple:  NoSQL means NoSQL.  No SQL query language and no relational database management system.  Yes, it’s disruptive and — by some measures — “crazy talk” but no, we shouldn’t hide because there are lots of perfectly valid (and now socially acceptable) reasons to want to differ from the relational status quo.

In effect, my definition of NoSQL is relational database alternative.  Such options include both alternative databases (e.g., MarkLogic) and database alternatives (e.g., key/value stores).  This, of course, then cuts at your definition of database management system where I (for now at least) still require the support of a query language and the option to have ACID transactions.

By the way, I understand the desire to exclude various bandwagon-jumpers from the NoSQL cause.  Like most, I have no interest in including thrice-reborn object databases in the discussion, but if the cost of excluding them is excluding systems like MarkLogic then I think that cost is too high.  Many people contemplating the top-of-mind NoSQL systems (e.g., Hadoop) could be better served using MarkLogic which addresses many typical NoSQL concerns, including:

  • Vast scale
  • High performance
  • Highly parallel shared-nothing clusters
  • Support for unstructured and semi-structured data

All with all the pros (and cons) of being a commercial software package and without requiring reduced consistency:  losing a few Tweets won’t kill Twitter, but losing a few articles, records, or individuals might well kill a patient, bank, or counter-terrorism agency.  BASE is fine for some; many others still need ACID.  Michael Stonebraker has some further points on this idea in this CACM post.

I’d like to suggest that we should combine the ideas in this post with the ideas in my prior one, Classifying Database Management Systems.  That post says the correct way to classify DBMSs is by their native modeling element (e.g., table, class, hypercube).  This post says that NoSQL is semi-orthogonal – i.e., I can imagine a table-oriented database that doesn’t use SQL as its query language, but I doubt that any exist.  Applying my various rules, the combined posts say that:

  • Aster is a SQL database optimized for analytics on big data
  • MarkLogic is an XML [document] database optimized for large quantities of semi-structured information and a NoSQL system
  • CouchDB is a document database and a NoSQL system
  • Reddis is a key/value store and a NoSQL system
  • VoltDB is a SQL database optimized to solve one of the two core problems that NoSQL systems are built for (i.e., high-volume simple processing)

Finally, I’d conclude that even with these rules I have trouble classifying MarkLogic because of multiple inheritance:  MarkLogic is both a document database and an XML database, it is difficult to pick one over the other, and I there certainly are non-document-oriented XML database systems.   Similar issues exist with classifying the various hybrids of document databases and key/value stores.  So while I may have more work to do on building an overall taxonomy, I am absolutely sure about one thing:  MarkLogic is a NoSQL system.


* The “Yes, Virginia” phrase comes from a 1897 story in the New York Sun.  For more, see here.

The Oracle of Unstructured Information: A Three-Horse Race

I’m often asked about my business vision for Mark Logic. My answer is simple: I want to be the Oracle of unstructured information.

When I say this, I mean more the Oracle of 1995 than the one of 2009 — i.e., the one focused on DBMS software and applications, not the CA-like financial consolidator of today. I want to be the clear leader in providing DBMS software for the development of applications and business intelligence systems.

  • In Oracle’s case that was relational DBMS software for the development of data applications.
  • In Mark Logic’s case that means XML DBMS software for the development of information applications.

By information, I mean the full information continuum, from structured data to semi-structured data to unstructured data. (It’s a concept that Gartner’s Rita Knox writes about for those interested in further information.)

In many ways, it short sells MarkLogic to call it an XML DBMS. That is why we position it instead as an XML Server. While the one-word difference may seemed nuanced, there is an important underlying distinction because MarkLogic Server is really three things rolled into one:

  • A database management system
  • A search engine
  • An application server

So, in one sense, MarkLogic Server is more than a DBMS. But, then again, that’s exactly what you’d expect from a next-generation DBMS — to be both different-from and more-than the prior generation. MarkLogic doesn’t just change the database software market, it changes the way you build database-driven, information-centric web applications.

As such, it’s actually a new piece of information infrastructure software on top of which our customers build a wide range of systems:

Now, I don’t think Mark Logic is the only company that aspires to become the Oracle of unstructured information. I think it’s a three-horse between Mark Logic, Autonomy, and Endeca. Since I talk about a lot about Mark Logic in this blog, let me instead take a moment to discuss the other participants.

The Financial Consolidator: Autonomy
Autonomy started out as a enterprise search company with a semantic and probabilistic flair, a firm that emphasized complex algorithms and theories to help improve search results. After some early success, that strategy resulted in the company getting stuck in the $50M to $60M revenue range for 5 years (2000-2004). To break out of those doldrums, Autonomy made the bold and risky acquisition of Verity, a struggling search rival 2.2x its size, in November, 2005. They then acquired email archiving provider Zantaz for $375M in July, 2007 and web content management system vendor Interwoven for $775M in January, 2009.

My perception is that Autonomy is a well run M&A shop. The reported financials are excellent and the operating margins are strong at 35%. Personally I’d feel significantly more re-assured if Autonomy reported under US GAAP, as most leading software companies do, as opposed to the seemingly looser IFRS rules under which fellow European search provider, Fast Search and Transfer, seemed to get itself into quite some trouble. I was at the France-based Business Objects for 9 years and we reported under both US GAAP and French rules the entire time, so I know first-hand that it can be done.

Autonomy’s challenge is that organic growth typically wanes in such M&A-driven ventures and the company — often despite intentions to the contrary — becomes addicted to inorganic growth. Note that when company A buys company B, for the following four quarters company A benefits from an inorganic growth tailwind by using (A+B)/A in calculating its growth. By the way, this analysis suggests that Autonomy is due for another acquisition since it’s been over a bit over a year since the Interwoven announcement.

Technologically, acquisition-based strategies create hodgepodges not platforms. While Autonomy is doing well size-wise in the race, my prediction is that the acquisition strategy will continue and result in a de facto strategy away from technology and information platforms and towards continued acquisitions, the financial engineering that accompanies them, and the major task of integrating them.

A firm can talk technology all day long, but if its revenues and growth are coming from acquisitions then it is a financial play, not a technology one. We are, after all, what we do, not what we say, nor what we aspire to do.

The Drifter: Endeca
Endeca started out as search and discovery company with original problem statement: how can we make it easier to find things on e-Bay? This lead to an early focus on combining both textual and data constraints (e.g., Frank Sinatra memorabilia costing less than $25) as well as the notion of multi-dimensional (or, faceted) navigation for the iterative refinement of queries.

This naturally led to a strong position in e-commerce search, which was a key company focus in the early and mid 2000s. Many online retailers use Endeca to power their e-commerce sites, including one of my personal favor
ites, K&L Wines, a local wine shop that I have been known to frequent.

Wine, in fact, provides a great demo for this technology because you have plenty of facets (e.g., price, country, region, grape, points) and plenty of the kind of ambiguity that search engines handle well through taxonomy and thesaurus, but that usually stumps databases. For example, searching for “claret” on the K&L site properly returns a collection of Bordeaux wines in which the word “claret” never appears in the name or description, despite the fact that claret is not part of the official French Bordeaux classification. Someone who knows wine has configured the system to understand that some people generally refer to Bordeaux as claret, and thus the system has processed the query appropriately. Wine also provides other merchandising opportunities such as up-sell, cross-sell, and recommendations.

In a prior revision of this post, I referred to Endeca as “the bolter,” but I decided to change that characterization because I’m not sufficiently sure how Endeca’s MDEX engine looks on the inside. I’ve always believed it’s a layer on top of an RDBMS and a search engine that bolts them together, but I’m not sure enough to bet my whole characterization on it.

What’s more, I think “drifter” better characterizes the company because I find them — like many before them — to fall into the grass-is-greener trap of getting bored with their core market (i.e., e-commerce) and instead venturing off in many new and different directions.

My prime example for this particular strategic malady is Informatica. When we partnered with them in the mid 1990s at Business Objects, Informatica was focused on data integration. Somewhere along the way, they got bored with data integration and decided to move into “analytic applications” market, certainly egged-on by the industry analysts of the day. One day ETL was all over their website; the next day it was gone.

In terms of marketing execution, the overnight transition was stunning. In terms of strategy, it was a disaster. It turns out that Informatica’s customers liked ETL. They didn’t know what analytic applications were. And if they did know and did want one, they could call either their BI or ERP vendor, both of which were viewed as a more logical supplier.

The other problem with Informatica’s analytic applications charge of the light brigade was that it alienated their partners. After years of successful partnership, Business Objects was forced to work more closely with Ascential and we later acquired Acta in our own ETL counter-strike. All of this because Informatica got bored with its market. (Sometimes I wonder if this phenomenon isn’t better labeled board-dom than boredom.)

If you’ve not heard, Informatica is doing quite well these days; the stock’s almost doubled in the past year with, guess what, a data integration focus. The strategy that was “too limiting” as a $150M company is working just fine as a $500M one.

I tell the story of Informatica because I feel like Endeca’s repeating it, though perhaps not executing as crisply on their way. Endeca’s message remains a bit of a hodgepodge. Here they’re a database. Here, they’re an information access platform. Here, they’re an enterprise search engine. And I believe the company’s go-to-market strategy is no more focused. After a strong historical focus in e-commerce, Endeca launched a big manufacturing push, and they continue to dabble in government and in media/publishing, Mark Logic’s two core markets.

The result: while I believe Endeca is approximately three times Mark Logic in overall size, I believe that Mark Logic is three times Endeca’s size in both publishing and government. That is the power of focus.

The moral is that in your hurry to get beyond e-commerce, you can end up everywhere and nowhere at the same time.

That said, I believe a lot of smart people still work at Endeca and I believe there is a strong and persistent tradition around the concept of uniting structured and unstructured information in an application platform. Bolted search/database or not, I expect Endeca to rediscover focus going forward and to re-emerge as a strong participant in the three-horse race.

Whither Oracle?
One might say it’s presumptuous to say that Oracle won’t be the Oracle of unstructured information, but I think it’s not. Since this post is long enough already, I won’t attempt to make a complete argument here. Instead, I’ll tee up a few points to show my line of reasoning and hopefully, return to this topic with a future post.

  • Oracle is a financial consolidator
  • The NoSQL movement is as much about the RDBMS oligopoly and associated vendor pricing and practices as it is about technology
  • In my opinion, the Innovator’s Dilemma paradox fully applies to Oracle in this instance

So, who’s in the three-horse race? Mark Logic, Autonomy, and Endeca. May the market-focused, technology disruptor win!