Category Archives: Database

Thoughts on MongoDB’s Humongous $150M Round

Two weeks ago MongoDB, formerly known as 10gen, announced a massive $150M funding round said to be the largest in the history of databases lead by Fidelity, Altimeter, and Salesforce.com with participation from existing investors Intel, NEA, Red Hat, and Sequoia.  This brings the total capital raised by MongoDB to $231M, making it the best-funded database / big data technology of all time.

What does this mean?

The two winners of the next-generation NoSQL database wars have been decided:  MongoDB and Hadoop.  The faster the runner-ups  figure that out, the faster they can carve off sensible niches on the periphery of the market instead of running like decapitated chickens in the middle. [1]

The first reason I say this is because of the increasing returns (or, network effects) in platform markets.  These effects are weak to non-existent in applications markets, but in core platform markets like databases, the rich invariably get richer.  Why?

  • The more people that use a database, the easier it is to find people to staff teams so the more likely you are to use it.
  • The more people that use a database, the richer the community of people you can leverage to get help
  • The more people that build applications atop a database, the less perceived risk there is in building a new application atop it.
  • The more people that use a database, the more jobs there are around it, which attracts more people to learn how to use it.
  • The more people that use a database, the cooler it is seen to be which in turn attracts more people to want to learn it.
  • The more people that use a database, the more likely major universities are to teach how to use it in their computer science departments.

To see just how strong MongoDB has become in this regard, see here.  My favorite analysis is the 451 Groups’ LinkedIn NoSQL skills analysis, below.

linkedinq31

This is why betting on horizontal underdogs in core platform markets is rarely a good idea.  At some point, best technology or not, a strong leader becomes the universal safe choice.  Consider 1990 to about 2005 where the relational model was the chosen technology and the market a comfortable oligopoly ruled by Oracle, IBM, and Microsoft.

It’s taken 30+ years (and numerous prior failed attempts) to create a credible threat to the relational stasis, but the combination of three forces is proving to be a perfect storm:

  • Open source business models which cut costs by a factor of 10
  • Increasing amounts of data in unstructured data types which do not map well to the relational model.
  • A change in hardware topology to from fewer/bigger computers to vast numbers of smaller ones.

While all technologies die slowly, the best days of relational databases are now clearly behind them.  Kids graduating college today see SQL the way I saw COBOL when I graduated from Berkeley in 1985.  Yes, COBOL was everywhere.  Yes, you could easily get a job programming it.  But it was not cool in any way whatsoever and it certainly was not the future.  It was more of a “trade school” language than interesting computer science.

The second reason I say this is because of my experience at Ingres, one of the original relational database providers which — despite growing from ~$30M to ~$250M during my tenure from 1985 to 1992 — never realized that it had lost the market and needed a plan B strategy.  In Ingres’s case (and with full 20/20 hindsight) there was a very viable plan B available:  as the leader in query optimization, Ingres could have easily focused exclusively on data warehousing at its dawn and become the leader in that segment as opposed to a loser in the overall market.  Yet, executives too often deny market reality, preferring to die in the name of “going big” as opposed to living (and prospering) in what could be seen as “going home.”  Runner-up vendors should think hard about the lessons of Ingres.

The last reason I say this is because of what I see as a change in venture capital. In the 1980s and 1990s VCs used to fund categories and cage-fights.  A new category would be identified, 5-10 companies would get created around it, each might raise $20-$30M in venture capital and then there would be one heck of a cage-fight for market leadership.

Today that seems less true.  VCs seem to prefer funding companies to categories.  (Does anyone know what category Box is in?  Does anyone care about any other vendor in it?)  Today, it seems that VCs fund fewer players, create fewer cage-fights, and prefer to invest much more, much later in a company that appears to be a clear winner.

This, so-called “momentum investing” itself helps to anoint winners because if Box can raise $309M, then it doesn’t really matter how smart the folks at WatchDox are or how clever their technology.

MongoDB is in this enviable position in the next-generation (open source) NoSQL database market.  It has built a huge following, that huge following is attracting a huge-r (sorry) following.  That cycle is attracting momentum investors who see MongoDB as the clear leader.  Those investors give MongoDB $150M.

By my math, if entirely invested in sales [2], that money could fund hiring some 500 sales teams who could generate maybe $400M a year in incremental revenue.  Which would in turn will attract more users.  Which would make the community bigger.  Which would de-risk using the system.  Which would attract more users.

And, quoting Vonnegut, so it goes.

# # #

Disclaimer:  I own shares in several of the companies mentioned herein as well as competitors who are not.  See my FAQ for more.

[1] Because I try to avoid writing about MarkLogic, I should be clear that while one can (and I have) argued that MarkLogic is a NoSQL system, my thinking has evolved over time and I now put much more weight on the open-source test as described in the “perfect storm” paragraph above.  Ergo, for the purposes of this post, I exclude MarkLogic entirely from the analysis because they are not in the open-source NoSQL market (despite the 451’s including them in their skills index).  Regarding MarkLogic, I have no public opinion and I do not view MongoDB’s or Hadoop’s success as definitively meaning either anything either good or bad for them.

[2] Which, by the way, they have explicitly said they will not do.  They have said, “the company will use these funds to further invest in the core MongoDB project as well as in MongoDB Management Service, a suite of tools and services to operate MongoDB at scale. In addition, MongoDB will extend its efforts in supporting its growing user base throughout the world.”

The Information Continuum and the Three Types of Subtly Semi-Structured Information

We generally refer to MarkLogic Server as an XML server, which is a special-purpose database management system (DBMS) for unstructured information.  This often sparks debate about the term “unstructured” and the information continuum in general.  Surprisingly, while both analysts and vendors frequently discuss the concept, the Wikipedia entry for information continuum is weak, and I couldn’t easily find a nice picture of it, so I decided to make my own.

The general idea that information spans a continuum with regard to structure is pretty much undisputed.  The placement of any given type of information on that continuum is more problematic.  While it seems clear the purchase orders are highly structured and that free text is not, the placement of, for example, email is more interesting.  Some might argue that email is unstructured.  In fact, only the body of an email is unstructured and there is plenty of metadata (e.g., from, send-to, date, subject) wrapping an email.  In addition, an email’s body actually does have latent structure — while it may not be explicit, you typically have a salutation followed by numerous paragraphs of text, a sign-off, a signature, and perhaps a legal footer.  Email is unquestionably semi-structured.

In fact, I believe that the vast majority of information is semi-structured.  PowerPoint decks have slides, slides have titles and bullets.  Contracts are typically word documents, but have more-or-less standard sections.  Proposals are usually Word or PowerPoint documents that tend to have similar structures.  Even the humble tweet is semi-structured:  while the contents are ostensibly 140 unstructured characters, the anatomy of a tweet reveals lots of metadata (e.g., location) and even the contents contain some structural information (e.g,. RT indicating re-tweet or #hashtags serving as topical metadata).

New let’s consider XML content.  Some would argue that XML is definitionally structured.  But I’d say that an arbitrary set of documents all stored within <document> and </document> tags is only faux structured; it appears structured because it’s XML, but the XML is just used as a container.  A corpus of twenty 2,000-page medical textbooks in 6 different schemas is indeed structured, but not well so.  To paraphrase an old saw about standards:  the nice thing about structures is that there are so many to choose from.  I believe that knowing content is marked up in XML reveals nothing about its structure, i.e., that XML-ness and structure are orthogonal.  Put differently, XML is simply a means of representing information.  The information represented may be highly structured (e.g., 100 purchase orders all in perfect adherence to a given schema) or highly unstructured (e.g., 20 documents only vaguely complying with 20 different schemas).

I have two primary beliefs about the information continuum:

  • The vast majority of information is semi-structured. There is relatively little highly structured and relatively little completely unstructured information out there.  Most information lies somewhere in the fat middle.  I overlaid a bell curve on top of the information continuum to reflect volume.
  • Even information that initially appears structured is often semi-structured.  I see three types of this subtly semi-structured information which, hopefully without being too cute, I’ll abbreviate as SSSI.  The three types are (1) schema as aspiration, (2)  time-varying schema, and (3) unknowable schema.

Let’s look at each of the three types more closely.

Schema as Aspiration

The first type of subtly semi-structured information (SSSI) is where a schema exists, but only notionally.  The schema itself is either poorly defined (actual quote:  “it is believed that this element is used for”) or well defined but not followed.  This is frequently the case with publishing and media companies.  Here are two free jokes that work well at any publishing conference:

  • Raise your hand if you have a standard schema.  Keep it up if your content actually adheres to it.
  • Oxymorons aside, how many of you have 3 or more “standard” schemas, 5 or more, … do  I hear 10?

These jokes are funny because of the state of the content.  This state is the result of two primary business trends:  (1) consolidation — most large publishers have been built through M&A thus inheriting numerous different standards, each of which may be only partly implemented — and (2) licensing — publishers frequently license content from numerous other sources, each with its own standard format.

Time-Varying Schema

The second case of SSSI is you where you have a well defined, enforced schema at any moment in time, but it keeps changing over time.  Typically this happens for one of two reasons:

  • The business reality that you’re modeling is changing.  For example, in 2009 Federal Sales was part of Eastern Sales but in 2010 it becomes its own division.  This makes comparison of Eastern results between 2009 and 2010 potentially difficult.  In BI circles, this is known as the slow-changing dimension problem.
  • Standards keep changing.  If you’re modeling information in a corporate- or industry-standard schema and that schema is changing, then your information becomes semi-structured because it is contained within multiple different schemas.  Sometimes you can avoid this by migrating all prior information to the current schema, but sometimes (e.g., massive data volumes, regulatory desire to not change existing records) you will not.

When viewed with a flash camera this information looks well structured.  When you look at the movie, you can clearly see that it’s not.

Unknowable Schema

The last case of SSSI is where you have an unknowable schema.  Consider terrorist tracking.  If you were to make a schema for a terrorist database, here are some of the attributes that spring to mind:  name, alias(es), address, former address(es), height, weight, hair color, eye color, member-of, enemy-of, friend-of, tattoos/markings.

Here are some problems with this:

  • Many of the attributes are multi-valued, such as alias or friend-of.  In a de-normalized approach, this means dealing with repeating group problems and creating N columns (e.g., alias, alias1, alias2, and up to the maximum number of aliases for any terrorist).  Normalization would take care of the repeating group but at the cost of creating a table for each multi-valued attribute and then having to join back to those tables when you run queries.  (One such real system ended up with 500 tables, with the result that no one could find anything.)
  • It is difficult to create a type for the tattoo attribute.  First, it’s multi-valued.  Second, while tattoos are sometimes images, they often contain text (e.g., Mom) and sometimes in a foreign language (e.g., 愛, the Chinese symbol for love).  Since you’re trying to secure the nation against threat you don’t want to throw away any potentially valuable information, but it’s not obvious how to store this.
  • New attributes are coming all the time.  Say you get a shoe print on a suspect as he runs away.  You need to add a shoe-size attribute to the database.  Say a terrorist runs away and leaves a pair of eyeglasses.  Now we need to add eyeglass prescription.  My favorite is what’s called pocket litter.  You find a piece of paper in a person’s pocket and it has a number on it.  It could be a phone number, a  lock combination, or maybe map coordinates.  You don’t know what it is — but again, since you don’t want to throw any potentially valuable information — you have to find a place to store it.
  • Combining an enormous number of potential attributes with the reality that very few are known for most individuals creates two problems:  (1) you end up with a sparse table which is not well handled in most RDBMSs and (2) you end up hitting column limits.

Another example of unknowable schemas would be in financial services, modeling derivatives.   Because derivatives are sometimes long-lived instruments (e.g., 30 years) you may face the time-varying schema problem.  In addition, you have the unknowable schema problem because the industry is constantly creating new products.  First we had CDOs and CDSs on banks, then single-tranche CDOs, then CDSs on single-tranche CDOs, and then synthetic CDOs.  If this makes your head hurt in terms of understanding, then think for a minute about data modeling.  How are you going to store these complex products in a database?   And what are you going to do with the never-ending stream of new ones — last I heard they were considering selling derivatives on movies.

(As it turns out XML is a great way to model both these problems as you can easily add new attributes on the fly and only provide values for attributes where you know them.)

To finish the post, I’ll revisit the statement I started with:  we generally refer to MarkLogic Server as an XML server, a special-purpose database management system (DBMS) for unstructured information.  Going forward, I think I’ll keep saying that because it’s simpler, but at the MarkLogic 201 level, the more precise statement is:  a special-purpose DBMS for semi-structured information.

There’s way more semi-structured information out there.  Realizing that information is semi-structured is sometimes subtle.  And semi-structured information is, in fact, the optimization point for our product.  So what’s MarkLogic in three concepts?  Speed, scale, and semi-structured information.

Dear CIO: Stop Writing Big Checks for Commodity (Database) Software

Dear CIO,

What’s wrong this picture?

  • At 50%+, Oracle’s operating margins have never been higher
  • The differentiation of Oracle’s database technology, however, has never been lower and the number of both core and specialized alternatives has never been greater.

So what’s going on? You, kind Sir or Madam, are being milked. What’s worse is that you, in an example of collective behavioral dysfunction, have inadvertently played a role in setting up the milking. What happened?

  • Like all smart CIOs you followed a bit of herd mentality when it came to core technology. Pity the poor fools who, back in the day, bet big on Ingres or Sybase. You played it safe and went with Oracle, IBM, or if your requirements weren’t too heavy, Microsoft.
  • The problem is, of course, that everyone executed the same strategy you did. Hence, the market created a system of increasing returns where the strong vendors got stronger and the weak ones died. The result: the RDBMS market is an (order of magnitude) $10B/year market, structured as an oligopoly with 3 players. Most other software markets worked out the same way.
  • You were focused on standardization. You realized that through a combination of decentralized IT decision making and growth-by-acquisition your organization had become a kitchen sink of enterprise software. You had everything. In order to reduce the administrative, training, and license acquisition costs, you fought tooth and nail with your divisions to standardize the environment. You said, “Heck, it’s all the same stuff in the end, folks, so let’s make Oracle our DBMS standard, Business Objects our BI standard, Documentum our ECM standard, and SAP our ERP standard.”
  • And you won. Mostly. There’s still some Cognos in finance. And marketing didn’t totally give up on Interwoven. But, for the most part, you won. You reduced the entropy of your IT environment and drove cost savings for your organization.

The problem is you’ve won the battle but lost the war. Why? Because if, as you say, the “stuff really is all the same” you shouldn’t standardize on the most expensive product. You should standardize on the cheapest.

  • Do you really need to be paying those big fees to Oracle for enterprise licenses? Wouldn’t MySQL do?
  • Are you really using all the functionality of that $1M/year Documentum ECM system? Wouldn’t SharePoint or Alfresco do?
  • For BI, do you need all the bells and whistles of BusinessObjects? Wouldn’t Pentaho or Qlikview do a fine job, at a fraction of the cost?

But these alternatives are obvious. Heck, even “the establishment” (i.e, Gartner) says it’s safe to tread in the open source water. So the question is, what’s holding you back?

  • Switching costs. It’s hard to move off Oracle or Documentum and you don’t want to pay the nut to do so.
  • Organizational inertia. Your whippersnapper DBAs who were in their 30s in the 1980s are now in their 50s. They’re thinking that change devalues their knowledge and experience; some just want to cruise into retirement. But that’s their personal agenda, not your enterprise one.
  • Accounting: you made it free for your divisions to keep using Documentum, Oracle, or BusinessObjects because you bought an enterprise license. While this appeared to “save” you money on a per-license basis, and it helped support your standardization initiative, it squashed innovation in your divisions, reinforced the organization inertia, and has a lot of people using the wrong tool for the job, resulting in projects that either take more or more expensive hardware than necessary (Oracle is good at this), that take too long to develop, or that simply fail.

So, what do I recommend doing about all this? I suggest that you adopt these policies, which –- for full disclosure, are at least partially in the self-interest of this blog’s author:

  • Stop writing big checks for commodity software. Every time a big check comes along, ask yourself: is this software differentiated or commoditized? Be willing to pay a premium for differentiated software, and price shop commodity software. Call a group of your smartest staff together periodically to help you make the commodity versus differentiated call.

  • When you see a big check coming for commodity software, make a migration plan. My hunch is that most of the time, you can create a nice 3-year ROI in the transition from premium to cheaper software. (This reminds me of the time I visited an investment bank’s CIO asking about their Documentum strategy. The answer: “our Documentum strategy is to get off Documentum,” because we’re paying too much and using too little.)

  • Stop doing enterprise agreements that create poor economic incentives within your organization. Don’t pay $XM at the enterprise level, spread that as a “tax” across your divisions, and then make use of certain software “free.” It distorts project reality, creates false incentives, squashes innovation, and generates lots of hidden costs. If you want to negotiate a master agreement and discount rate, that’s fine. Shoot for centralized discounts without central planning.
  • Don’t worry that the prior policies will create mayhem. While I understand that you don’t want arbitrary taste differences increasing the entropy of your enterprise software portfolio, recognize that with the first policy you’ve solved that problem already. If you deem a category (e.g., core RDBMS, enterprise search) commoditized, then you are going to force people to pick on cost. You’ll get standardization on the commodity categories –- just on the least expensive alternatives. The only entropy you’ll need to manage will be on the differentiated software which, having dispatched the commodity majority, you’ll have time to explore, study, and exploit.

Why I am taking the time to write this note to you? Back in the 1980s I was a foot soldier in the relational database revolution, and today I’m the CEO of one specialized DBMS company and on the board of another.

  • Mark Logic makes an XML server which can save great amounts of time and money in creating applications against unstructured information, replacing the combination of an RDBMS, an enterprise search engine, and an application server. Not only can Mark Logic manage 100s of TB of XML, the system eliminates the object / relational/ hierarchical impedance mismatch between Java, SQL, and XML that hampers developer productivity. Mark Logic was recently named the fourth fastest-growing IT company in Silicon Valley.
  • Aster Data makes a specialized data warehouse DBMS that runs on low-cost commodity hardware with a shared nothing architecture and leverages in-database MapReduce technology for parallelism and high scalability.

And during the past 25 years or so I’ve watched the market evolve. While I fully understand the policies and market forces that have led
us to where we are, I feel like we’ve come full circle. Vendor power is now concentrated in the big three. Vendor margins top 50%. Big vendors don’t innovate; they consolidate. Inertia has set in customer organizations. And there’s a major platform shift in progress; last time it was mainframe to minicomputer, this time it’s cloud.

Things feel a lot to me the way they did in 1985, just past dawn of the relational revolution. So in one way I’m writing to point out the oft-overlooked obvious: stop paying premium prices for commodity items. And in another way I’m saying, take the money you save in so doing and invest it in innovation technologies that:

  • Drive competitive advantage (which will matter again as we come out of the Great Recession)
  • Enable the Internet-scale applications you’ll need to face the coming information deluge
  • Reform the application development stack in ways that make sense for the coming generation of information applications, not that made sense for the last generation of data-centric ones.

Thank you for reading my note. If you have any questions or comments, please give me a ping at dave-dot-kellogg-at-marklogic-com or comment on this post.

Sincerely,

Dave Kellogg