The Information Continuum and the Three Types of Subtly Semi-Structured Information

We generally refer to MarkLogic Server as an XML server, which is a special-purpose database management system (DBMS) for unstructured information.  This often sparks debate about the term “unstructured” and the information continuum in general.  Surprisingly, while both analysts and vendors frequently discuss the concept, the Wikipedia entry for information continuum is weak, and I couldn’t easily find a nice picture of it, so I decided to make my own.

The general idea that information spans a continuum with regard to structure is pretty much undisputed.  The placement of any given type of information on that continuum is more problematic.  While it seems clear the purchase orders are highly structured and that free text is not, the placement of, for example, email is more interesting.  Some might argue that email is unstructured.  In fact, only the body of an email is unstructured and there is plenty of metadata (e.g., from, send-to, date, subject) wrapping an email.  In addition, an email’s body actually does have latent structure — while it may not be explicit, you typically have a salutation followed by numerous paragraphs of text, a sign-off, a signature, and perhaps a legal footer.  Email is unquestionably semi-structured.

In fact, I believe that the vast majority of information is semi-structured.  PowerPoint decks have slides, slides have titles and bullets.  Contracts are typically word documents, but have more-or-less standard sections.  Proposals are usually Word or PowerPoint documents that tend to have similar structures.  Even the humble tweet is semi-structured:  while the contents are ostensibly 140 unstructured characters, the anatomy of a tweet reveals lots of metadata (e.g., location) and even the contents contain some structural information (e.g,. RT indicating re-tweet or #hashtags serving as topical metadata).

New let’s consider XML content.  Some would argue that XML is definitionally structured.  But I’d say that an arbitrary set of documents all stored within <document> and </document> tags is only faux structured; it appears structured because it’s XML, but the XML is just used as a container.  A corpus of twenty 2,000-page medical textbooks in 6 different schemas is indeed structured, but not well so.  To paraphrase an old saw about standards:  the nice thing about structures is that there are so many to choose from.  I believe that knowing content is marked up in XML reveals nothing about its structure, i.e., that XML-ness and structure are orthogonal.  Put differently, XML is simply a means of representing information.  The information represented may be highly structured (e.g., 100 purchase orders all in perfect adherence to a given schema) or highly unstructured (e.g., 20 documents only vaguely complying with 20 different schemas).

I have two primary beliefs about the information continuum:

  • The vast majority of information is semi-structured. There is relatively little highly structured and relatively little completely unstructured information out there.  Most information lies somewhere in the fat middle.  I overlaid a bell curve on top of the information continuum to reflect volume.
  • Even information that initially appears structured is often semi-structured.  I see three types of this subtly semi-structured information which, hopefully without being too cute, I’ll abbreviate as SSSI.  The three types are (1) schema as aspiration, (2)  time-varying schema, and (3) unknowable schema.

Let’s look at each of the three types more closely.

Schema as Aspiration

The first type of subtly semi-structured information (SSSI) is where a schema exists, but only notionally.  The schema itself is either poorly defined (actual quote:  “it is believed that this element is used for”) or well defined but not followed.  This is frequently the case with publishing and media companies.  Here are two free jokes that work well at any publishing conference:

  • Raise your hand if you have a standard schema.  Keep it up if your content actually adheres to it.
  • Oxymorons aside, how many of you have 3 or more “standard” schemas, 5 or more, … do  I hear 10?

These jokes are funny because of the state of the content.  This state is the result of two primary business trends:  (1) consolidation — most large publishers have been built through M&A thus inheriting numerous different standards, each of which may be only partly implemented — and (2) licensing — publishers frequently license content from numerous other sources, each with its own standard format.

Time-Varying Schema

The second case of SSSI is you where you have a well defined, enforced schema at any moment in time, but it keeps changing over time.  Typically this happens for one of two reasons:

  • The business reality that you’re modeling is changing.  For example, in 2009 Federal Sales was part of Eastern Sales but in 2010 it becomes its own division.  This makes comparison of Eastern results between 2009 and 2010 potentially difficult.  In BI circles, this is known as the slow-changing dimension problem.
  • Standards keep changing.  If you’re modeling information in a corporate- or industry-standard schema and that schema is changing, then your information becomes semi-structured because it is contained within multiple different schemas.  Sometimes you can avoid this by migrating all prior information to the current schema, but sometimes (e.g., massive data volumes, regulatory desire to not change existing records) you will not.

When viewed with a flash camera this information looks well structured.  When you look at the movie, you can clearly see that it’s not.

Unknowable Schema

The last case of SSSI is where you have an unknowable schema.  Consider terrorist tracking.  If you were to make a schema for a terrorist database, here are some of the attributes that spring to mind:  name, alias(es), address, former address(es), height, weight, hair color, eye color, member-of, enemy-of, friend-of, tattoos/markings.

Here are some problems with this:

  • Many of the attributes are multi-valued, such as alias or friend-of.  In a de-normalized approach, this means dealing with repeating group problems and creating N columns (e.g., alias, alias1, alias2, and up to the maximum number of aliases for any terrorist).  Normalization would take care of the repeating group but at the cost of creating a table for each multi-valued attribute and then having to join back to those tables when you run queries.  (One such real system ended up with 500 tables, with the result that no one could find anything.)
  • It is difficult to create a type for the tattoo attribute.  First, it’s multi-valued.  Second, while tattoos are sometimes images, they often contain text (e.g., Mom) and sometimes in a foreign language (e.g., 愛, the Chinese symbol for love).  Since you’re trying to secure the nation against threat you don’t want to throw away any potentially valuable information, but it’s not obvious how to store this.
  • New attributes are coming all the time.  Say you get a shoe print on a suspect as he runs away.  You need to add a shoe-size attribute to the database.  Say a terrorist runs away and leaves a pair of eyeglasses.  Now we need to add eyeglass prescription.  My favorite is what’s called pocket litter.  You find a piece of paper in a person’s pocket and it has a number on it.  It could be a phone number, a  lock combination, or maybe map coordinates.  You don’t know what it is — but again, since you don’t want to throw any potentially valuable information — you have to find a place to store it.
  • Combining an enormous number of potential attributes with the reality that very few are known for most individuals creates two problems:  (1) you end up with a sparse table which is not well handled in most RDBMSs and (2) you end up hitting column limits.

Another example of unknowable schemas would be in financial services, modeling derivatives.   Because derivatives are sometimes long-lived instruments (e.g., 30 years) you may face the time-varying schema problem.  In addition, you have the unknowable schema problem because the industry is constantly creating new products.  First we had CDOs and CDSs on banks, then single-tranche CDOs, then CDSs on single-tranche CDOs, and then synthetic CDOs.  If this makes your head hurt in terms of understanding, then think for a minute about data modeling.  How are you going to store these complex products in a database?   And what are you going to do with the never-ending stream of new ones — last I heard they were considering selling derivatives on movies.

(As it turns out XML is a great way to model both these problems as you can easily add new attributes on the fly and only provide values for attributes where you know them.)

To finish the post, I’ll revisit the statement I started with:  we generally refer to MarkLogic Server as an XML server, a special-purpose database management system (DBMS) for unstructured information.  Going forward, I think I’ll keep saying that because it’s simpler, but at the MarkLogic 201 level, the more precise statement is:  a special-purpose DBMS for semi-structured information.

There’s way more semi-structured information out there.  Realizing that information is semi-structured is sometimes subtle.  And semi-structured information is, in fact, the optimization point for our product.  So what’s MarkLogic in three concepts?  Speed, scale, and semi-structured information.

Thoughts on the Qlik Technologies (QlikTech) IPO

I spent an hour or so browsing the QlikTech S-1 and thought I’d share some observations.  (See here for my prior post on the company.)

  • The company has achieved good scale (2009 revenues of $157M) but growth has been decelerating from 82% in 2007 to 47% in 2008 to 33% in 2009.
  • Gross margins are high at 89% due largely to normal margins on license (96%), unusually  high margins on support (96%), normal margins on consulting (27%), and a fairly small consulting business (10% of total revenues) which reduces the pull-down effect on the weighted average.  Wall Street will like this.
  • Sales and marketing expense is high at 59% of sales.  Provided switching costs are high, you can argue this is a good investment, and provided growth is high, you can justify it.   I’m going to assume they make some “lost year” arguments about 2009 in their story and will guide to re-accelerated growth, but I’m not sure.  If not, then they will get pressure about the inefficiency of their sales model.
  • R&D is spectacularly low at 6% of sales.  There is an argument that if you have a largely completed (cheap and cheerful) BI tool that you should simply go sell the heck out of it and not artificially spend money in R&D when you have neither the vision nor the immediate need to either create new products or investment big money in enhancing your existing one.  I’ve just seen few companies try to make it.  I suspect Wall St. will pressure them to increase this number, regardless of whether it’s the strategically right thing to do for the company.
  • Expanded customer base from 1,500 customers in 2005 to over 13,000 in 2009.
  • I like their argument that because it’s easier to use than traditional BI tools that it should get greater penetration than the average 28% of potential BI users cited by IDC.
  • The unique business model (free downloads and 30-day guarantee post purchase) are consistent with the cheap and cheerful product positioning, which is good.  It does beg the question why sales costs so much, however, if you’re primarily upselling downloaders in a low-commitment fashion.
  • I think the claim “analysis tools are not designed for business users” is over-stated.  I can assure you that at BusinessObjects we were designing products for business users.
  • I dislike the small piece of huge pie argument, but I suppose that particular fallacy is so embedded in human nature that it will never go away.  I’d rather hear that QlikTech thinks its 2010 potential market is $400M and it wants 50% than hear – as it says in the prospectus — that they think it’s $8.6B and they presumably want somewhere around 2%.
  • They expect 63M shares outstanding after the offering, implying that if they want a $10-$15 share price that they think the company can justify a market cap in the $750M to $1B range.  If it were generating more than a 4% return on sales and growing faster than 33% that would be easier to assume.
  • 50% of 2009 license and FYM came from indirect channels.  This again begs the question why sales cost so much; indirect channels are, in theory, more cost-effective than direct.
  • They had 124 “dedicated direct sales professionals” as of 12/31/09, which suggests to me that at an average productivity of $1.8M (including all ramping and turnover effects) they could do $223M in revenues in 2010, or growth in the 40% range.  So they seem well teed-up from a sales hiring perspective.
  • If my US readers are wondering why you’ve not heard of them, it’s because they were originally founded in Sweden and do 77% of revenues “internationally” (which now means outside the US given that they moved their headquarters in 2004).   This relative lack of US presence should presumably hurt the stock.
  • They have a pretty traditional enterprise software business model:  perpetual license and maintenance.  They even state potential demand for SaaS BI as a risk factor.
  • They had $35M in deferred revenue on the balance sheet as of 12/31/09.  This strikes me as high; some quick back-of-the-envelope calculations led me to expect ~$25M if it was all the undelivered portion of pre-paid, single-year maintenance contracts.
  • Per IDC, 44% of QlikView customers deploy within a month and 77% deploy within three months.  It sounds impressive and is consistent with the small consulting business.  But it also depends on the definition of deploy.
  • This is no overnight success story; the company was founded in Sweden in 1993.  There was a six-year product development phase (which perhaps explains the low R&D today) from 1993 to 1999.  From 1999 to 2004 they sold almost exclusively in Europe.  From 2004, they added USA sales and relocated the HQ to Pennsylvania.
  • 2009 maintenance renewal rate of 85%
  • They intend to increase R&D expenses to increase in both absolute dollars and as a percent of sales going forward.
  • 73% of revenues are not dollar denominated.  This means that foreign exchange rates should hit them more (both ways) than for a typical software company.
  • This sounds typical:

Our quarterly results reflect seasonality in the sale of our products and services. Historically, a pattern of increased license sales in the fourth quarter has positively impacted sales activity in that period which can make it difficult to achieve sequential revenue growth in the first quarter. Similarly, our gross margins and operating income have been affected by these historical trends because the majority of our expenses are relatively fixed in the near-term.

  • USA revenues grew at 28% in 2009, a bit slower than company overall. Fairly surprising, given the late USA start and the presumably huge market opportunity.
  • R&D remains in Lund, Sweden with 54 staff as of 12/31/09.
  • 574 total employees as of 12/31/09 with 148 in the USA and 426 outside.
  • Accel is the biggest shareholder with 26.7% of the stock, pre-offering.
  • The proposed ticker symbol is QLIK
  • My brain started to melt around page 120.  (Somehow the document set I managed to pull down from the SEC site is about1,000 pages and includes a zillion appendices.  The regular S-1 is here.)
  • Click on the image below to blow up their recent financials.

Yes, Virginia, MarkLogic is a NoSQL System

The other day I noticed a taxonomy used on one of the NoSQL Database blogs that went like this:

Types of NoSQL systems

  • Core NoSQL Systems
    • Wide column stores
    • Document stores
    • Key-value / tuple stores
    • Eventually consistent key-value stores
    • Graph databases
  • Soft NoSQL Systems (not the original intention …)
    • Object databases
    • Grid database solutions
    • XML databases
    • Other NoSQL-related databases

I, perhaps obviously, take some umbrage at having MarkLogic (acceptably classified as an XML database) being declared “soft NoSQL.”  In this post I’ll explain why.

Who decided that being open source was a requirement to be real NoSQL system?  More importantly, who gets to decide?  NoSQL – like the Tea Party – is a grass-roots, effectively leaderless movement towards relational database alternatives.  Anyone arguing original intent of the founders is misguided because there is no small group of clearly identified founders to ask.  In reality, all you can correctly argue is what you think was the intent of the initial NoSQL developers and early adopters, or — perhaps more customarily — why you were drawn to them yourself, disguised or confused as original founder intent.

As mentioned here, movements often appear homogeneous when they are indeed heterogeneous.  What looks like a long line of demonstrators protesting a single cause is in fact a rugby scrum of different groups pushing in only generally aligned directions.  For example, for each of the following potential motivations, I am certain that I can find some set of NoSQL advocates that are motivated by it:

  • Anger at Oracle’s heavy-handed licensing policies
  • The need to store unstructured or semi-structured data that doesn’t fit well into relations
  • The impedance mismatch with relational databases
  • A need and/or desire to use open source
  • An attempt to reduce total cost
  • A desire to land at a different point in the Brewer CAP Theorem triangle of consistency, availability, and partition tolerance
  • Coolness / wannabe-ism, as in, I want to be like Google or Facebook

(Since this was a source of confusion in prior posts, note that this is not to claim the inverse:  that all NoSQL advocates are motivated by all of the possible motivations.)

I’d like to advocate a simple idea:  that NoSQL means NoSQL.  That a NoSQL system is defined as:

A structured storage system that is not based on relational database technology and does not use SQL as its primary query language

In short, my proposed definition means that NoSQL (broadly) = NoSQL (literally) + NoRelational.  In short:  relational database alternatives.  It does not mean:

  • NoDBMS.  We should not take NoSQL to exclude systems we would traditionally define as DBMSs.  For example, supporting ACID transactions or supporting a non-SQL query language (e.g., XQuery) should not be exclusion criteria for NoSQL.
  • NoCommercialSoftware.  While many of the flagship NoSQL projects (e.g., Hadoop, CouchDB) are open source projects, that should be not a defining criterion.  NoSQL should be a technological, not a delivery- or business-model, classification.  Technology and delivery model are orthogonal dimensions.   We should be able to speak of traditionally licensed, open source licensed, and cloud-hosted NoSQL systems if for no other reason than understanding the nuances of the various business/delivery models is a major task unto itself.  Do you mean open source or open core?  Is it open source or faux-pen source?  Under which open source license?  How should I think of a hosted subscription service that is a based on or a derivative of an open source project?

Recently, I’ve heard a piece of backpeddling that I’ve found rather irritating:  that NoSQL was never intended to mean “no SQL,” it was actually intended to mean “not only SQL.”  Frankly, this strikes me as hogwash:  uh oh, I’m afraid that people are seeing us as disruptors and it’s probably easier to penetrate the enterprise as complementary, not competitive, so let’s turn what was a direct assault into a flanking attack.

To me, it’s simple:  NoSQL means NoSQL.  No SQL query language and no relational database management system.  Yes, it’s disruptive and — by some measures — “crazy talk” but no, we shouldn’t hide because there are lots of perfectly valid (and now socially acceptable) reasons to want to differ from the relational status quo.

In effect, my definition of NoSQL is relational database alternative.  Such options include both alternative databases (e.g., MarkLogic) and database alternatives (e.g., key/value stores).  This, of course, then cuts at your definition of database management system where I (for now at least) still require the support of a query language and the option to have ACID transactions.

By the way, I understand the desire to exclude various bandwagon-jumpers from the NoSQL cause.  Like most, I have no interest in including thrice-reborn object databases in the discussion, but if the cost of excluding them is excluding systems like MarkLogic then I think that cost is too high.  Many people contemplating the top-of-mind NoSQL systems (e.g., Hadoop) could be better served using MarkLogic which addresses many typical NoSQL concerns, including:

  • Vast scale
  • High performance
  • Highly parallel shared-nothing clusters
  • Support for unstructured and semi-structured data

All with all the pros (and cons) of being a commercial software package and without requiring reduced consistency:  losing a few Tweets won’t kill Twitter, but losing a few articles, records, or individuals might well kill a patient, bank, or counter-terrorism agency.  BASE is fine for some; many others still need ACID.  Michael Stonebraker has some further points on this idea in this CACM post.

I’d like to suggest that we should combine the ideas in this post with the ideas in my prior one, Classifying Database Management Systems.  That post says the correct way to classify DBMSs is by their native modeling element (e.g., table, class, hypercube).  This post says that NoSQL is semi-orthogonal – i.e., I can imagine a table-oriented database that doesn’t use SQL as its query language, but I doubt that any exist.  Applying my various rules, the combined posts say that:

  • Aster is a SQL database optimized for analytics on big data
  • MarkLogic is an XML [document] database optimized for large quantities of semi-structured information and a NoSQL system
  • CouchDB is a document database and a NoSQL system
  • Reddis is a key/value store and a NoSQL system
  • VoltDB is a SQL database optimized to solve one of the two core problems that NoSQL systems are built for (i.e., high-volume simple processing)

Finally, I’d conclude that even with these rules I have trouble classifying MarkLogic because of multiple inheritance:  MarkLogic is both a document database and an XML database, it is difficult to pick one over the other, and I there certainly are non-document-oriented XML database systems.   Similar issues exist with classifying the various hybrids of document databases and key/value stores.  So while I may have more work to do on building an overall taxonomy, I am absolutely sure about one thing:  MarkLogic is a NoSQL system.


* The “Yes, Virginia” phrase comes from a 1897 story in the New York Sun.  For more, see here.

The Fit or Fat Startup

As I sit here at Palantir’s Govcon 5 conference at the lavish Ritz Carlton in Tyson’s Corner (Virgina), I can’t help but think about the recent “fit or fat” startup debate that hit the blogosphere a few weeks back.  The debate started with a post by VC Ben Horowitz of Andreessen Horowitz entitled The Case for the Fat Startup.  Excerpts:

The [Sequoia RIP Good Times] presentation catalyzed a movement. Startups everywhere adopted a lean, low-burn, low-investment model. To this day, companies seeking funding at our venture firm, Andreessen Horowitz, proudly proclaim in their pitch decks that they are raising tiny amounts of capital so they can run lean.

Here is my central argument. There are only two priorities for a startup:  (1) winning the market and (2) not running out of cash.

Running lean is not an end. For that matter, neither is running fat. Both are tactics that you use to win the market and not run out of cash before you do so. By making “running lean” an end, you may lose your opportunity to win the market, either because you fail to fund the R&D necessary to find product/market fit or you let a competitor out-execute you in taking the market. Sometimes running fat is the right thing to do.

The part of his argument with which I agree is the “sometimes.”  The simple fact is that strategy must be a function of situation and there are indeed some situations (e.g., landgrabs) where the run-fat model is required.  By landgrabs, I mean the early days of new, destined-to-be-large markets with sufficient switching costs so as to realistically justify losing lots of money in the quest to establish market leadership.  Examples include Amazon in online retail and PayPal (which raised $194M) in online payments. Remember these strategies do not always end happily:  WebVan consumed  $1.2B in venture capital before it went bankrupt in 2001.  Hence my two key criteria:

  • A destined-to-be-large market (WebVan missed here, the market for web-ordered groceries today is still non-existent)
  • Sufficient switching costs to justify the years of  major losses (many online retailers who “sold dollars for ninety cents” were surprised to find their customers disappeared when they tried to sell them for $1.05)

When you “go big or go home,” sometimes you go home.  I’d argue that the media biases us by looking primarily at successes, not failures, artificially reducing the perceived risk in such strategies.  It’s a bit like saying inner-city youth can escape the inner city through athletic scholarships.  Yes, it does happen.  And yes those athletes sometimes become rich and famous.  But simply because it sometimes works, you cannot argue it’s a good strategy.

I was going to use Oracle as example because they played the landgrab game superbly in the early days of the RDBMS market.  But I think they only raised $10M or so before their IPO in 1986. (Vent:  I just wasted 30 minutes trying to find a precise answer).  So unlike the go-big VC burners, Oracle largely self-funded its ten-year journey to $50M.  My prior employer, Business Objects, raised a total of less than $5M in VC.

The debate picked up steam when fellow VC Fred Wilson of Union Square Ventures responded with a post entitled Being Fat Is Not Healthy.  Excerpt:

In short, since I started investing in the web in ’93/’94, I have invested in about 100 software-based web companies. And the success rate of fat companies versus lean companies is stark. I have never, not once, been successful with an investment in a company that raised a boatload of money before it found traction and product market fit with its primary product.

Boatload is a subjective term. So is traction. So is product market fit. And so is successful. So let me try to define them in the way that I think about them. A boatload of cash is more than $20mm of invested capital. A boatload of cash is monthly burn rates of tens of millions of dollars. Traction and product market fit are customers or users buying or using your product in droves. It is the realization that you’ve found the sweet spot of the market you were going for. And successful is an investment that pays out multiples of the dollars we invested in it. Getting our money back is not successful in my book. Getting three times our money back is good. More than that is great.

Let me say it again. I have never been involved in a successful software-based web service that raised and spent boatloads of money before it found it’s sweet spot. But it has happened. The Loudcloud story that Ben lived and tells in the All Things D post is proof that it can happen.

You can also win the lottery. The odds aren’t great that you will. But millions of people play it every day. I don’t.

Basically, I agree with Fred, with the sole exception of those Amazon- and PayPal-like landgrabs that really are one-shot opportunities that someone is going to win.  The problem is that entrepreneurs, being rabid and optimistic, assume they are in that 1-in-1000 situation about 95% of the time.

Back to Palantir, I think they’re pretty clearly playing the “fat” strategy.  That’s logical because the founders are from PayPal and are undoubtedly applying some rewind/play logic from those days and should certainly have some survivor bias because — well — it worked last time.   (Try convincing a lottery winner that buying lottery tickets is, on average, a very bad idea.) While they’ve raised $35M to-date, I suspect they’ll be raising another round soon, especially if they are to grow from 250 to 400 employees by December 31st as CEO Alex Karp said this morning.

My issue for Palantir is that I don’t see a landgrab market opportunity which they (see prior points) most certainly do.  The technology looks like a set of nice data visualization and graph analysis tools; kind of a nice suite of graph-centered BI tools for tracking entities, relationships, events, and documents across collections of unstructured and structured data tapped from various repositories.  While the front-ends are sexy, and most likely easier to use than what they’re replacing, if you think using traditional BI tools is tough, I think these tools are harder.  Search meets BI this ain’t.

Visualization companies have had a checkered history in enterprise software, with the most successful being vertical and application specific (e.g., Spotfire), so I think Palantir’s vertical focus on government is a good one.  They seem also to make an effort in finance, but my gut feel is that they’re 90% government.  The company is good at PR, has some creative and interesting management philosophies (e.g., 210 of the 250 employees are supposedly “engineers”), and has a professorial and clearly very intelligent CEO.

Operationally, I think they’d be an excellent partner for Mark Logic because we specialize in back-end heavy lifting and (whether they’d freely admit it or not) everything I saw today strikes me as front-end and/or data aggregation, as opposed to data management, technology.  I know we have some partners in common and I believe some customers may have integrated the systems.

Could Palantir be BusinessObjects for unstructured data?  I don’t think so — the technology seems too specialized and too hard for the average user; it’s clearly made for analysts.  On the other hand, could they be MicroStrategy?  Maybe.

Either way, they’re one of very few enterprise software startups these days playing it fat.  If I’m right, they’ll be raising another round in the next few quarters, probably at a nice valuation, and basically playing Horowitz’s playbook.  On verra.

Classifying Database Management Systems: Regular and NoSQL

Thanks to two major trends — DBMS specialization and the NoSQL movement — the database management systems space is generating more interest and more innovation than any time I can remember since the 1980s.  Ever since around 1990, when the relational database management system (RDBMS) became firmly established, IT has played DBMSroulette:  spin the wheel and use the DBMS on which the needle lands — Oracle, DB2, or SQL Server.  (If you think this trivializes things, not so fast:  a friend who was the lead DBMS analyst at a major analyst firm once quipped to me that this wheel-spinning was his job, circa 1995.)

Obviously, there was always some rational basis for DBMS selection — IBM shops tended to pick DB2, best-of-breed buyers liked Oracle, performance whizzes and finance types often picked Sybase, and frugal shoppers would choose SQL Server, and later MySQL — but there was no differentiation in the model.  All these choices were relational database management systems.

Over time, our minds became dulled to orthogonal dimensions of database differentiation:

  • The database model.  For years, we lived in the database equivalent world of Henry Ford’s Model T:  any model you want as long as it’s relational.
  • The potential for trade-offs in fundamental database-ness.  We became binary and religious about what it meant be a database management system and that attitude blinded us to some fundamental trade-offs that some users might want to make — e.g., trading consistency for scalability, or trading ACID transactions for BASE.

The latter is the domain of Brewer’s CAP theorem which I will not discuss today.  The former, the database model, will be the subject of this post.

Every DBMS has some native modeling element (NME). For example, in an RDBMS that NME is the relation (or table).  Typically that NME is used to store everything in the DBMS.  For example, in an RDBMS:

  • User data is stored in tables.
  • Indexes are implemented as tables which are joined back to the base tables.
  • Administration information is stored in tables.
  • Security is usually handled through tables  and joins.
  • Unusual data types (e.g., XML) are stored in “odd columns” in tables.  (If your only model’s a table, every problem looks like a column.)

In general, the more naturally the data you’re storing maps to the paradigm (or NME) of the database, the better things will work.  For example, you can model XML documents as tables and store them in an RDBMS, or you can model tables in XML and store them as XML documents, but those approaches will tend to be more difficult to implement and less efficient to process than simply storing tables in an RDBMS and XML documents in an XML server (e.g., MarkLogic).

The question is not whether you can model documents as tables or tables as documents.  The answer is almost always yes.  Thus, the better question is should you?  The most famous example of this type of modeling problem is the storage of hierarchical data in an RDBMS.  To quote this article on managing hierarchical data in MySQL:

Most users at one time or another have dealt with hierarchical data in a SQL database and no doubt learned that the management of hierarchical data is not what a relational database is intended for.

(Personally, I blame the failure of Microsoft’s WinFS on this root problem — file systems are inherently hierarchical — but that’s  a story for a different day.)

I believe the best way to classify DBMSs is by their native modeling element.

  • In hierarchical databases, the NME is the hierarchy.  Example:  IMS.
  • In network databases, it’s the (directed, acyclic) graph. Example:  IDMS.
  • In relational databases, it’s the relation (or, table).  Example:  Oracle.
  • In object databases, it’s the (typically C++) object class. Example:  Versant.
  • In multi-dimensional databases, it’s the hypercube. Example:  Essbase.
  • In document databases, it’s the document. Example:  CouchDB.
  • In key/value stores, it’s the key/value pair. Example:  Redis.
  • In XML databases, it’s the XML document. Example:  MarkLogic.

The biggest limitation of this approach is that classifying by model fails to capture implementation differences. Some examples:

  • I would classify columnar DBMSs (e.g., Vertica) as relational if they model data as tables, and key/value stores (e.g., Hbase) as such if they model data in key/value pairs.  This fails to capture the performance advantage that Vertica gets on certain data warehousing problems due to its column orientation.
  • I would classify all relational databases as relational, despite implementation optimizations.  For example, this approach fails to capture Teradata’s optimizations for large-scale data warehousing, Aster’s optimizations for analytics on big data, or Volt’s optimizations for what Curt Monash calls HVSP.
  • I would classify all XML databases as XML databases, despite possible optimization differences for the two basic XML use-cases:  (1) XML as message wrapper vs. (2) XML as document markup.

Nevertheless, I believe that DBMSs should be classified first by model and then sub-classified by implementation optimization.  For example, a relational database optimized for big data analytics (Aster).  An XML database optimized for large amounts of semi-structured information marked in XML (MarkLogic).

In closing, I’d say that we are seeing increasing numbers of customers coming to Mark Logic saying:  “well, I suppose we could have modeled this data relationally, but in our business we think of this information as documents and we’ve decided that it’s easier and more natural to manage it that way, so we decided to give you a call.”

After thinking about this for some time, I have one response:  keep calling!

No matter how you want to think about MarkLogic Server — an XML server, an XML database, or an XML document database — dare I say an [XML] [document] server|database  — it’s definitely a document-oriented, XML-oriented database management system and a great place to put any information that you think is more naturally modeled as documents.