Category Archives: Database

Joining the Profisee Board of Directors

We’re announcing today that I’m joining the board of directors of Profisee, a leader in master data management (MDM).  I’m doing so for several reasons, mostly reflecting my belief that successful technology companies are about three things:  the people, the space, and the product.

I like the people at both an investor and management level.  I’m old friends with a partner at ParkerGale, the private equity (PE) firm backing Profisee, and I quite like the people at ParkerGale, the culture they’ve created, their approach to working with companies, and of course the lead partner on Profisee, Kristina Heinze.

The management team, led by veteran CEO and SAP alumnus Len Finkle, is stocked with domain experts from larger companies including SAP, Oracle, Hyperion, and Informatica.  What’s more, Gartner VP and analyst Bill O’Kane recently joined the company.  Bill covered the space at Gartner for over 8 years and has personally led MDM initiatives at companies including MetLife, CA Technologies, Merrill Lynch, and Morgan Stanley.  It’s hard to read Bill’s decision to join the team as anything but a big endorsement of the company, its leadership, and its strategy.

These people are the experts.  And instead of working at a company where MDM is an element of an element of a suite that no one really cares about anymore, they are working at a focused market leader that worries about MDM — and only MDM – all day, every day.  Such focus is powerful.

I like the MDM space for several reasons:

  • It’s a little obscure. Many people can’t remember if MDM stands for metadata management or master data management (it’s the latter).  It’s under-penetrated; relatively few companies who can benefit from MDM use it.  Historically the market has been driven by “reluctant spend” to comply with regulatory requirements.  Megavendors don’t seem to care much about MDM anymore, with IBM losing market share and Oracle effectively exiting the market.  It’s the perfect place for a focused specialist to build a team of people who are passionate about the space and build a market-leading company.
  • It’s substantial. It’s a $1B market today growing at 5%.  You can build a nice company stealing share if you need to, but I think there’s an even bigger opportunity.
  • It’s teed up to grow. On the operational side, I think that single source of truth, digital transformation, and compliance initiatives will drive the market.  On the analytical side, if there’s one thing 20+ years in and around business intelligence (BI) has taught me, it’s GIGO (garbage in, garbage out).  If you think the GIGO rule was important in traditional BI, I’d argue it’s about ten times more important in an artificial intelligence and machine learning (AI/ML) world.  Garbage data in, garbage model and garbage predictions out.  Data quality is the Achilles’ heel of modern analytics.

I like Profisee’s product because:

  • It’s delivering well for today’s customers.
  • It has the breadth to cover a wide swath of MDM domains and use-cases.
  • It provides a scalable platform with a broad range of MDM-related functionality, as opposed to a patchwork solution set built through acquisition.
  • It’s easy to use and makes solving complex problems simple.
  • It’s designed for rapid implementation, so it’s less costly to implement and faster to get in production which is great for both committed MDM users and — particularly important in an under-penetrated market – those wanting to give MDM a try.

I look forward to working with Len, Kristina, and the team to help take Profisee to the next level, and beyond.

Now, before signing off, let me comment on how I see Profisee relative to my existing board seat at Alation.  Alation defined the catalog space, has an impressive list of enterprise customers, raised a $50M round earlier this year, and has generally been killing it.  If you don’t know the data space well you might see these companies as competitive; in reality, they are complementary and I think it’s synergistic for me to work with both.

  • Data catalogs help you locate data and understand the overall data set. For example, with a data catalog you can find all of the systems and data sets where you have customer data across operational applications (e.g., CRM, ERP, FP&A) and analytical systems (e.g., data warehouses, data lakes).
  • MDM helps you rationalize the data across your operational and analytical systems.  At its core, MDM solves the problem of IBM being entered in your company’s CRM system as “Intl Business Machines,” in your ERP system as “International Business Machines,” and in your planning system as “IBM Corp,” to give a simple example.  Among other approaches, MDM introduces the concept of a golden record which provides a single source of truth of how, in this example, the customer should be named.

In short, data catalogs help you find the right data and MDM ensures the data is clean when you find it.  You pretty obviously need both.

It Ain’t Easy Making Money in Open Source:  Thoughts on the Hortonworks S-1

It took me a week or so to get to it, but in this post I’ll take a dive into the Hortonworks S-1 filing in support of a proposed initial public offering (IPO) of their stock.

While Hadoop and big data are unarguably huge trends driving the industry and while the future of Hadoop looks very bright indeed, on reading the Hortonworks S-1, the reader is drawn to the inexorable conclusion that  it’s hard to make money in open source, or more crassly, it’s hard to make money when you give the shit away.

This is a company that,  in the past three quarters, lost $54M on $33M of support/services revenue and threw in $26M in non-recoverable (i.e., donated) R&D atop that for good measure.

Let’s take it top to bottom:

  • They have solid bankers: Goldman Sachs, Credit Suisse, and RBC are leading the underwriting with specialist support from Pacific Crest, Wells Fargo, and Blackstone.
  • They have an awkward, jargon-y, and arguably imprecise marketing slogan: “Enabling the Data-First Enterprise.”  I hate to be negative, but if you’re going to lose $10M a month, the least you can do is to invest in a proper agency to make a good slogan.
  • Their mission is clear: “to establish Hadoop as the foundational technology of the modern enterprise data architecture.”
  • Here’s their solution description: “our solution is an enterprise-grade data management platform built on a unique distribution of Apache Hadoop and powered by YARN, the next generation computing and resource management framework.”
  • They were founded in 2011, making them the youngest company I’ve seen file in quite some years. Back in the day (e.g., the 1990s) you might go public at age 3-5, but these days it’s more like age 10.
  • Their strategic partners include Hewlett-Packard, Microsoft, Rackspace, Red Hat, SAP, Teradata, and Yahoo.
  • Business model:  “consistent with our open source approach, we generally make the Hortonworks Data Platform available free of charge and derive the predominant amount of our revenue from customer fees from support subscription offerings and professional services.”  (Note to self:  if you’re going to do this, perhaps you shouldn’t have -35% services margins, but we’ll get to that later.)
  • Huge market opportunity: “According to Allied Market Research, the global Hadoop market spanning hardware, software and services is expected to grow from $2.0 billion in 2013 to $50.2 billion by 2020, representing a compound annual growth rate, or CAGR, of 58%.”  This vastness of the market opportunity is unquestioned.
  • Open source purists: “We are committed to serving the Apache Software Foundation open source ecosystem and to sharing all of our product developments with the open source community.”  This one’s big because while it’s certainly strategic and it certainly earns them points within the Hadoop community, it chucks out one of the better ways to make money in open source:  proprietary versions / extensions.  So, right or wrong, it’s big.
  • Headcount:  The company has increased the number of full-time employees from 171 at December 31, 2012 to 524 at September 30, 2014

Before diving into the financials, let me give readers a chance to review open source business models (Wikipedia, Kellblog) if they so desire, before making the (generally true but probably slightly inaccurate) assertion:  the only open source company that’s ever made money (at scale) is Red Hat.

Sure, there have been a few great exits.  Who can forget MySQL selling to Sun for $1B?  Or VMware buying SpringSource for $420M?  Or RedHat buying JBoss for $350M+?  (Hortonworks CEO Rob Bearden was involved in both of the two latter deals.)   Or Citrix buying XenSource for $500M?

But after those deals, I can’t name too many others.  And I doubt any of those companies was making money.

In my mind there are a two common things that go wrong in open source:

  • The market is too small. In my estimation open source compresses the market size by 10-20x.  So if you want to compress the $30B DBMS market 10x, you can still build several nice companies.  However, if you want to compress the $1B enterprise search market by 10x, there’s not much room to build anything.  That’s why there is no Red Hat of Lucene or Solr, despite their enormous popularity in search.    For open source to work, you need to be in a huge market.
  • People don’t renew. No matter which specific open source business model you’re using, the general play is to sell a subscription to <something> that complements your offering.  It might be a hardened/certified version of the open source product.  It might be additions to it that you keep proprietary forever or, in a hardcover/paperback analogy, roll back into the core open source projects with a 24 month lag.  It might be simply technical support.  Or, it might be “admission the club” as one open source CEO friend of mine used to say:  you get to use our extensions, our support, our community, etc.  But no matter what you’re selling, the key is to get renewals.  The risk is that the value of your extensions decreases over time and/or customers become self-sufficient.    This was another problem with Lucene.  It was so good that folks just didn’t need much help and if they did, it was only for a year or so.

So Why Does Red Hat work?

Red Hat uses a professional open source business model  applied to primarily two low-level infrastructure categories:  operating systems and later middleware.   As general rules:

  • The lower-level the category the more customers want support on it.
  • The more you can commoditize the layers below you, the more the market likes it. Red Hat does this for servers.
  • The lower-level the category the more the market actually “wants” it standardized in order to minimize entropy. This is why low-level infrastructure categories become natural monopolies or oligopolies.

And Red Hat set the right price point and cost structure.  In their most recent 10-Q, you can see they have 85% gross margins and about a 10% return on sales.  Red Hat nailed it.

But, if you believe this excellent post by Andreessen Horowitz partner Peter Levine, There Will Never Be Another Red Hat.  As part of his argument Levine reminds us that while Red Hat may be a giant among open source vendors, that among general technology vendors they are relatively small.  See the chart below for the market capitalization compared to some megavendors.

rhat small fish

Now this might give pause to the Hadoop crowd with so many firms vying to be the Red Hat of Hadoop.  But that hasn’t stopped the money from flying in.  Per Crunchbase, Cloudera has raised a stunning $1.2B in venture capital, Hortonworks has raised $248M, and MapR has raised $178M.  In the related Cassandra market, DataStax has raised $190M.  MongoDB (with its own open source DBMS) has raised $231M.  That’s about $2B invested in next-generation open source database venture capital.

While I’m all for open source, disruption, and next-generation databases (recall I ran MarkLogic for six years), I do find the raw amount of capital invested pretty crazy.   Yes, it’s a huge market today.  Yes, it’s exploding as do data volumes and the new incorporation of unstructured data.  But we will be compressing it 10-20x as part of open-source-ization.  And, given all the capital these guys are raising – and presumably burning (after all, why else would you raise it), I can assure you that no one’s making money.

Hortonworks certainly isn’t — which serves as a good segue to dive into the financials.  Here’s the P&L, which I’ve cleaned up from the S-1 and color-annotated.

horton pl

  •  $33M in trailing three quarter (T3Q) revenues ($41.5M in TTM, though not on this chart)
  • 109% growth in T3Q revenues
  • 85% gross margins on support
  • Horrific -35% gross margins on services which given the large relative size of the services business (43% of revenues) crush overall gross margins down to 34%
  • More scarily this calls into question the veracity of the 85% subscription gross margins — I recall reading in the S-1 that they current lack VSOE for subscription support which means that they’ve not yet clearly demonstrated what is really support revenue vs. professional services revenue.  [See footnote 1]
  • $26M in T3Q R&D expense.  Per their policy all that value is going straight back to the open source project which begs the question will they ever see return on it?
  • Net loss of $86.7M in T3Q, or nearly $10M per month

Here are some other interesting tidbits from the S-1:

  • Of the 524 full-time employee as of 9/30/14, there are 56 who are non-USA-based
  • CEO makes $250K/year in base salary cash compensation with no bonus in FY13 (maybe they missed plan despite strong growth?)
  • Prior to the offering CEO owns 6.8% of the stock, a pretty nice percentage, but he was a kind-of a founder
  • Benchmark owns 18.7%
  • Yahoo owns 19.6%
  • Index owns 9.5%
  • $54.9M cash burn from operations in T3Q, $6.1M per month
  • Number of support subscription customers has grown from 54 to 233 over the year from 9/30/13 to 9/30/14
  • A single customer represented went from 47% of revenues for the T3Q ending 9/30/13 down to 22% for the T3Q ending 9/30/14.  That’s a lot of revenue concentration in one customer (who is identified as “Customer A,” but who I believe is Microsoft based on some text in the risk factors.)

Here’s a chart I made of the increase in value in the preferred stock.  A ten-bagger in 3 years.

horton pref

One interesting thing about the prospectus is they show “gross billings,” which is an interesting derived metric that financial analysts use to try and determine bookings in a subscription company.  Here’s what they present:

horton billings

While gross billings is not a bad stab at bookings, the two metrics can diverge — primarily when the duration of prepaid contracts changes.  Deferred revenue can shoot up when sales sells longer prepaid contracts to a given number of customers as opposed to the same-length contract to more of them.  Conversely, if happy customers reduce prepaid contract duration to save cash in a downturn, it can actually help the vendor’s financial performance (they will get the renewals because the customer is happy and not discount in return for multi-year), but deferred revenue will drop as will gross billings.  In some ways, unless prepaid contract duration is held equal, gross billings is more of a dangerous metric than anything else.  Nevertheless Hortonworks is showing it as an implied metric of bookings or orders and the growth is quite impressive.

Sales and Marketing Efficiency

Let’s now look at sales and marketing efficiency, not using the CAC which is too hard to calculate for public companies but using JMP’s sales and marketing efficiency metric = gross profit [current] – gross profit [prior] / S&M expense [prior].

On this metric Hortonworks scores a 41% for the T3Q ended 9/30/14 compared to the same period in 2013.  JMP considers anything above 50% efficient, so they are coming in low on this metric.  However, JMP also makes a nice chart that correlates S&M efficiency to growth and I’ve roughly hacked Hortonworks onto it here:

JMP

I’ll conclude the main body of the post by looking at their dollar-based expansion rate.  Here’s a long quote from the S-1:

Dollar-Based Net Expansion Rate.    We believe that our ability to retain our customers and expand their support subscription revenue over time will be an indicator of the stability of our revenue base and the long-term value of our customer relationships. Maintaining customer relationships allows us to sustain and increase revenue to the extent customers maintain or increase the number of nodes, data under management and/or the scope of the support subscription agreements. To date, only a small percentage of our customer agreements has reached the end of their original terms and, as a result, we have not observed a large enough sample of renewals to derive meaningful conclusions. Based on our limited experience, we observed a dollar-based net expansion rate of 125% as of September 30, 2014. We calculate dollar-based net expansion rate as of a given date as the aggregate annualized subscription contract value as of that date from those customers that were also customers as of the date 12 months prior, divided by the aggregate annualized subscription contract value from all customers as of the date 12 months prior. We calculate annualized support subscription contract value for each support subscription customer as the total subscription contract value as of the reporting date divided by the number of years for which the support subscription customer is under contract as of such date.

This is probably the most critical section of the prospectus.  We know Hortonworks can grow.  We know they have a huge market.  We know that market is huge enough to be compressed 10-20x and still have room to create a a great company.  What we don’t know is:  will people renew?   As we discussed above, we know it’s one of the great risks of open source

Hortonworks pretty clearly answers the question with “we don’t know” in the above quote.  There is simply not enough data, not enough contracts have come up for renewal to get a meaningful renewal rate.  I view the early 125% calculation as a very good sign.  And intuition suggests that — if their offering is quality — that people will renew because we are talking low-level, critical infrastructure and we know that enterprises are willing to pay to have that supported.

# # #

Appendix

In the appendix below, I’ll include a few interesting sections of the S-1 without any editorial comments.

A significant portion of our revenue has been concentrated among a relatively small number of large customers. For example, Microsoft Corporation historically accounted for 55.3% of our total revenue for the year ended April 30, 2013, 37.8% of our total revenue for the eight months ended December 31, 2013 and 22.4% of our total revenue for the nine months ended September 30, 2014. The revenue from our three largest customers as a group accounted for 71.0% of our total revenue for the year ended April 30, 2013, 50.5% of our total revenue for the eight months ended December 31, 2013 and 37.4% of our total revenue for the nine months ended September 30, 2014. While we expect that the revenue from our largest customers will decrease over time as a percentage of our total revenue as we generate more revenue from other customers, we expect that revenue from a relatively small group of customers will continue to account for a significant portion of our revenue, at least in the near term. Our customer agreements generally do not contain long-term commitments from our customers, and our customers may be able to terminate their agreements with us prior to expiration of the term. For example, the current term of our agreement with Microsoft expires in July 2015, and automatically renews thereafter for two successive twelve-month periods unless terminated earlier. The agreement may be terminated by Microsoft prior to the end of its term. Accordingly, the agreement with Microsoft may not continue for any specific period of time.

# # #

We do not currently have vendor-specific objective evidence of fair value for support subscription offerings, and we may offer certain contractual provisions to our customers that result in delayed recognition of revenue under GAAP, which could cause our results of operations to fluctuate significantly from period-to-period in ways that do not correlate with our underlying business performance.

In the course of our selling efforts, we typically enter into sales arrangements pursuant to which we provide support subscription offerings and professional services. We refer to each individual product or service as an “element” of the overall sales arrangement. These arrangements typically require us to deliver particular elements in a future period. We apply software revenue recognition rules under U.S. generally accepted accounting principles, or GAAP. In certain cases, when we enter into more than one contract with a single customer, the group of contracts may be so closely related that they are viewed under GAAP as one multiple-element arrangement for purposes of determining the appropriate amount and timing of revenue recognition. As we discuss further in “Management’s Discussion and Analysis of Financial Condition and Results of Operations—Critical Accounting Policies and Estimates—Revenue Recognition,” because we do not have VSOE for our support subscription offerings, and because we may offer certain contractual provisions to our customers, such as delivery of support subscription offerings and professional services, or specified functionality, or because multiple contracts signed in different periods may be viewed as giving rise to multiple elements of a single arrangement, we may be required under GAAP to defer revenue to future periods. Typically, for arrangements providing for support subscription offerings and professional services, we have recognized as revenue the entire arrangement fee ratably over the subscription period, although the appropriate timing of revenue recognition must be evaluated on an arrangement-by-arrangement basis and may differ from arrangement to arrangement. If we are unexpectedly required to defer revenue to future periods for a significant portion of our sales, our revenue for a particular period could fall below  our expectations or those of securities analysts and investors, resulting in a decline in our stock price

 # # #

We generate revenue by selling support subscription offerings and professional services. Our support subscription agreements are typically annual arrangements. We price our support subscription offerings based on the number of servers in a cluster, or nodes, data under management and/or the scope of support provided. Accordingly, our support subscription revenue varies depending on the scale of our customers’ deployments and the scope of the support agreement.

 Our early growth strategy has been aimed at acquiring customers for our support subscription offerings via a direct sales force and delivering consulting services. As we grow our business, our longer-term strategy will be to expand our partner network and leverage our partners to deliver a larger proportion of professional services to our customers on our behalf. The implementation of this strategy is expected to result in an increase in upfront costs in order to establish and further cultivate such strategic partnerships, but we expect that it will increase gross margins in the long term as the percentage of our revenue derived from professional services, which has a lower gross margin than our support subscriptions, decreases.

 # # #

Deferred Revenue and Backlog

Our deferred revenue, which consists of billed but unrecognized revenue, was $47.7 million as of September 30, 2014.

Our total backlog, which we define as including both cancellable and non-cancellable portions of our customer agreements that we have not yet billed, was $17.3 million as of September 30, 2014. The timing of our invoices to our customers is a negotiated term and thus varies among our support subscription agreements. For multiple-year agreements, it is common for us to invoice an initial amount at contract signing followed by subsequent annual invoices. At any point in the contract term, there can be amounts that we have not yet been contractually able to invoice. Until such time as these amounts are invoiced, we do not recognize them as revenue, deferred revenue or elsewhere in our consolidated financial statements. The change in backlog that results from changes in the average non-cancelable term of our support subscription arrangements may not be an indicator of the likelihood of renewal or expected future revenue, and therefore we do not utilize backlog as a key management metric internally and do not believe that it is a meaningful measurement of our future revenue.

 # # #

We employ a differentiated approach in that we are committed to serving the Apache Software Foundation open source ecosystem and to sharing all of our product developments with the open source community. We support the community for open source Hadoop, and employ a large number of core committers to the various Enterprise Grade Hadoop projects. We believe that keeping our business model free from architecture design conflicts that could limit the ultimate success of our customers in leveraging the benefits of Hadoop at scale is a significant competitive advantage.

 # # #

International Data Corporation, or IDC, estimates that data will grow exponentially in the next decade, from 2.8 zettabytes, or ZB, of data in 2012 to 40 ZBs by 2020. This increase in data volume is forcing enterprises to upgrade their data center architecture and better equip themselves both to store and to extract value from vast amounts of data. According to IDG Enterprise’s Big Data Survey, by late 2014, 31% of enterprises with annual revenues of $1 billion or more expect to manage more than one PB of data. In comparison, as of March 2014 the Library of Congress had collected only 525 TBs of web archive data, equal to approximately half a petabyte and two million times smaller than a zettabyte.

# # #

Footnotes:

[1]  Thinking more about this, while I’m not an accountant, I think the lack of VSOE has the following P&L impact:  it means that in contracts that mix professional services and support they must recognize all the revenue ratably over the contract.  That’s fine for the support revenue, but it should have the effect of pushing out services revenue, artificially depressing services gross margins.  Say, for example you did a $240K that was $120K of each.  The support should be recognized at $30K/quarter.  However, if the consulting is delivered in the first six months it should be delivered at $60K/quarter for the first and second quarters and $0 in the third and fourth.  Since, normally, accountants will take the services costs up-front this should have the effect of hurting services by taking the costs as delivered but by the revenue over a longer period.

[2] See here for generic disclaimers and please note that in the past I have served as an advisor to MongoDB

Thoughts on MongoDB’s Humongous $150M Round

Two weeks ago MongoDB, formerly known as 10gen, announced a massive $150M funding round said to be the largest in the history of databases lead by Fidelity, Altimeter, and Salesforce.com with participation from existing investors Intel, NEA, Red Hat, and Sequoia.  This brings the total capital raised by MongoDB to $231M, making it the best-funded database / big data technology of all time.

What does this mean?

The two winners of the next-generation NoSQL database wars have been decided:  MongoDB and Hadoop.  The faster the runner-ups  figure that out, the faster they can carve off sensible niches on the periphery of the market instead of running like decapitated chickens in the middle. [1]

The first reason I say this is because of the increasing returns (or, network effects) in platform markets.  These effects are weak to non-existent in applications markets, but in core platform markets like databases, the rich invariably get richer.  Why?

  • The more people that use a database, the easier it is to find people to staff teams so the more likely you are to use it.
  • The more people that use a database, the richer the community of people you can leverage to get help
  • The more people that build applications atop a database, the less perceived risk there is in building a new application atop it.
  • The more people that use a database, the more jobs there are around it, which attracts more people to learn how to use it.
  • The more people that use a database, the cooler it is seen to be which in turn attracts more people to want to learn it.
  • The more people that use a database, the more likely major universities are to teach how to use it in their computer science departments.

To see just how strong MongoDB has become in this regard, see here.  My favorite analysis is the 451 Groups’ LinkedIn NoSQL skills analysis, below.

linkedinq31

This is why betting on horizontal underdogs in core platform markets is rarely a good idea.  At some point, best technology or not, a strong leader becomes the universal safe choice.  Consider 1990 to about 2005 where the relational model was the chosen technology and the market a comfortable oligopoly ruled by Oracle, IBM, and Microsoft.

It’s taken 30+ years (and numerous prior failed attempts) to create a credible threat to the relational stasis, but the combination of three forces is proving to be a perfect storm:

  • Open source business models which cut costs by a factor of 10
  • Increasing amounts of data in unstructured data types which do not map well to the relational model.
  • A change in hardware topology to from fewer/bigger computers to vast numbers of smaller ones.

While all technologies die slowly, the best days of relational databases are now clearly behind them.  Kids graduating college today see SQL the way I saw COBOL when I graduated from Berkeley in 1985.  Yes, COBOL was everywhere.  Yes, you could easily get a job programming it.  But it was not cool in any way whatsoever and it certainly was not the future.  It was more of a “trade school” language than interesting computer science.

The second reason I say this is because of my experience at Ingres, one of the original relational database providers which — despite growing from ~$30M to ~$250M during my tenure from 1985 to 1992 — never realized that it had lost the market and needed a plan B strategy.  In Ingres’s case (and with full 20/20 hindsight) there was a very viable plan B available:  as the leader in query optimization, Ingres could have easily focused exclusively on data warehousing at its dawn and become the leader in that segment as opposed to a loser in the overall market.  Yet, executives too often deny market reality, preferring to die in the name of “going big” as opposed to living (and prospering) in what could be seen as “going home.”  Runner-up vendors should think hard about the lessons of Ingres.

The last reason I say this is because of what I see as a change in venture capital. In the 1980s and 1990s VCs used to fund categories and cage-fights.  A new category would be identified, 5-10 companies would get created around it, each might raise $20-$30M in venture capital and then there would be one heck of a cage-fight for market leadership.

Today that seems less true.  VCs seem to prefer funding companies to categories.  (Does anyone know what category Box is in?  Does anyone care about any other vendor in it?)  Today, it seems that VCs fund fewer players, create fewer cage-fights, and prefer to invest much more, much later in a company that appears to be a clear winner.

This, so-called “momentum investing” itself helps to anoint winners because if Box can raise $309M, then it doesn’t really matter how smart the folks at WatchDox are or how clever their technology.

MongoDB is in this enviable position in the next-generation (open source) NoSQL database market.  It has built a huge following, that huge following is attracting a huge-r (sorry) following.  That cycle is attracting momentum investors who see MongoDB as the clear leader.  Those investors give MongoDB $150M.

By my math, if entirely invested in sales [2], that money could fund hiring some 500 sales teams who could generate maybe $400M a year in incremental revenue.  Which would in turn will attract more users.  Which would make the community bigger.  Which would de-risk using the system.  Which would attract more users.

And, quoting Vonnegut, so it goes.

# # #

Disclaimer:  I own shares in several of the companies mentioned herein as well as competitors who are not.  See my FAQ for more.

[1] Because I try to avoid writing about MarkLogic, I should be clear that while one can (and I have) argued that MarkLogic is a NoSQL system, my thinking has evolved over time and I now put much more weight on the open-source test as described in the “perfect storm” paragraph above.  Ergo, for the purposes of this post, I exclude MarkLogic entirely from the analysis because they are not in the open-source NoSQL market (despite the 451’s including them in their skills index).  Regarding MarkLogic, I have no public opinion and I do not view MongoDB’s or Hadoop’s success as definitively meaning either anything either good or bad for them.

[2] Which, by the way, they have explicitly said they will not do.  They have said, “the company will use these funds to further invest in the core MongoDB project as well as in MongoDB Management Service, a suite of tools and services to operate MongoDB at scale. In addition, MongoDB will extend its efforts in supporting its growing user base throughout the world.”

Thoughts on MongoDB's Humongous $150M Round

Two weeks ago MongoDB, formerly known as 10gen, announced a massive $150M funding round said to be the largest in the history of databases lead by Fidelity, Altimeter, and Salesforce.com with participation from existing investors Intel, NEA, Red Hat, and Sequoia.  This brings the total capital raised by MongoDB to $231M, making it the best-funded database / big data technology of all time.
What does this mean?
The two winners of the next-generation NoSQL database wars have been decided:  MongoDB and Hadoop.  The faster the runner-ups  figure that out, the faster they can carve off sensible niches on the periphery of the market instead of running like decapitated chickens in the middle. [1]
The first reason I say this is because of the increasing returns (or, network effects) in platform markets.  These effects are weak to non-existent in applications markets, but in core platform markets like databases, the rich invariably get richer.  Why?

  • The more people that use a database, the easier it is to find people to staff teams so the more likely you are to use it.
  • The more people that use a database, the richer the community of people you can leverage to get help
  • The more people that build applications atop a database, the less perceived risk there is in building a new application atop it.
  • The more people that use a database, the more jobs there are around it, which attracts more people to learn how to use it.
  • The more people that use a database, the cooler it is seen to be which in turn attracts more people to want to learn it.
  • The more people that use a database, the more likely major universities are to teach how to use it in their computer science departments.

To see just how strong MongoDB has become in this regard, see here.  My favorite analysis is the 451 Groups’ LinkedIn NoSQL skills analysis, below.
linkedinq31
This is why betting on horizontal underdogs in core platform markets is rarely a good idea.  At some point, best technology or not, a strong leader becomes the universal safe choice.  Consider 1990 to about 2005 where the relational model was the chosen technology and the market a comfortable oligopoly ruled by Oracle, IBM, and Microsoft.
It’s taken 30+ years (and numerous prior failed attempts) to create a credible threat to the relational stasis, but the combination of three forces is proving to be a perfect storm:

  • Open source business models which cut costs by a factor of 10
  • Increasing amounts of data in unstructured data types which do not map well to the relational model.
  • A change in hardware topology to from fewer/bigger computers to vast numbers of smaller ones.

While all technologies die slowly, the best days of relational databases are now clearly behind them.  Kids graduating college today see SQL the way I saw COBOL when I graduated from Berkeley in 1985.  Yes, COBOL was everywhere.  Yes, you could easily get a job programming it.  But it was not cool in any way whatsoever and it certainly was not the future.  It was more of a “trade school” language than interesting computer science.
The second reason I say this is because of my experience at Ingres, one of the original relational database providers which — despite growing from ~$30M to ~$250M during my tenure from 1985 to 1992 — never realized that it had lost the market and needed a plan B strategy.  In Ingres’s case (and with full 20/20 hindsight) there was a very viable plan B available:  as the leader in query optimization, Ingres could have easily focused exclusively on data warehousing at its dawn and become the leader in that segment as opposed to a loser in the overall market.  Yet, executives too often deny market reality, preferring to die in the name of “going big” as opposed to living (and prospering) in what could be seen as “going home.”  Runner-up vendors should think hard about the lessons of Ingres.
The last reason I say this is because of what I see as a change in venture capital. In the 1980s and 1990s VCs used to fund categories and cage-fights.  A new category would be identified, 5-10 companies would get created around it, each might raise $20-$30M in venture capital and then there would be one heck of a cage-fight for market leadership.
Today that seems less true.  VCs seem to prefer funding companies to categories.  (Does anyone know what category Box is in?  Does anyone care about any other vendor in it?)  Today, it seems that VCs fund fewer players, create fewer cage-fights, and prefer to invest much more, much later in a company that appears to be a clear winner.
This, so-called “momentum investing” itself helps to anoint winners because if Box can raise $309M, then it doesn’t really matter how smart the folks at WatchDox are or how clever their technology.
MongoDB is in this enviable position in the next-generation (open source) NoSQL database market.  It has built a huge following, that huge following is attracting a huge-r (sorry) following.  That cycle is attracting momentum investors who see MongoDB as the clear leader.  Those investors give MongoDB $150M.
By my math, if entirely invested in sales [2], that money could fund hiring some 500 sales teams who could generate maybe $400M a year in incremental revenue.  Which would in turn will attract more users.  Which would make the community bigger.  Which would de-risk using the system.  Which would attract more users.
And, quoting Vonnegut, so it goes.

# # #

Disclaimer:  I own shares in several of the companies mentioned herein as well as competitors who are not.  See my FAQ for more.
[1] Because I try to avoid writing about MarkLogic, I should be clear that while one can (and I have) argued that MarkLogic is a NoSQL system, my thinking has evolved over time and I now put much more weight on the open-source test as described in the “perfect storm” paragraph above.  Ergo, for the purposes of this post, I exclude MarkLogic entirely from the analysis because they are not in the open-source NoSQL market (despite the 451’s including them in their skills index).  Regarding MarkLogic, I have no public opinion and I do not view MongoDB’s or Hadoop’s success as definitively meaning either anything either good or bad for them.
[2] Which, by the way, they have explicitly said they will not do.  They have said, “the company will use these funds to further invest in the core MongoDB project as well as in MongoDB Management Service, a suite of tools and services to operate MongoDB at scale. In addition, MongoDB will extend its efforts in supporting its growing user base throughout the world.”

My Slides from the MarkLogic Government Summit: “Relationertia”

Below please find an embedded copy of the slides I presented a few weeks back at the MarkLogic Government Summit at the Ritz-Carlton in Tyson’s Corner.

I had three fun quotes/concepts from this session.

First, I created a new word to describe all the reasons organizations use relational databases to try and solve problems for which they were never designed and at which they are suboptimal:  relationertia.  You know those reasons:

  • It’s safe
  • We have it already
  • It’s what we know
  • It’s free at the project level (if expensive at the agency one)

The fact is relational databases are about 40 years old and were never designed to solve some of the problems that government agencies are throwing at them.  To drive home the age point, I made a list of “other things” that happened in 1970, the year that Codd’s seminal paper was published.

  • Janis Joplin died
  • The Beatles broke up, after releasing Let It Be
  • The first 747 entered service
  • The first episode of All My Children aired

It was a long time ago.  (And that was the second fun thing.)

The third fun thing was to dust off one of my favorite old saws:  if your only tool’s a hammer, then every problem looks like a nail.  Or, as I more colorfully saw on Twitter today:  if your only tool’s a chainsaw, then every problem looks like a Zombie.

Applying this idea to relational databases, we come up with:

If your only data modeling element’s a table, then every problem looks like a column.

The slides are embedded below.

My Slides from the MarkLogic Government Summit: "Relationertia"

Below please find an embedded copy of the slides I presented a few weeks back at the MarkLogic Government Summit at the Ritz-Carlton in Tyson’s Corner.

I had three fun quotes/concepts from this session.

First, I created a new word to describe all the reasons organizations use relational databases to try and solve problems for which they were never designed and at which they are suboptimal:  relationertia.  You know those reasons:

  • It’s safe
  • We have it already
  • It’s what we know
  • It’s free at the project level (if expensive at the agency one)

The fact is relational databases are about 40 years old and were never designed to solve some of the problems that government agencies are throwing at them.  To drive home the age point, I made a list of “other things” that happened in 1970, the year that Codd’s seminal paper was published.

  • Janis Joplin died
  • The Beatles broke up, after releasing Let It Be
  • The first 747 entered service
  • The first episode of All My Children aired

It was a long time ago.  (And that was the second fun thing.)

The third fun thing was to dust off one of my favorite old saws:  if your only tool’s a hammer, then every problem looks like a nail.  Or, as I more colorfully saw on Twitter today:  if your only tool’s a chainsaw, then every problem looks like a Zombie.

Applying this idea to relational databases, we come up with:

If your only data modeling element’s a table, then every problem looks like a column.

The slides are embedded below.

The Information Continuum and the Three Types of Subtly Semi-Structured Information

We generally refer to MarkLogic Server as an XML server, which is a special-purpose database management system (DBMS) for unstructured information.  This often sparks debate about the term “unstructured” and the information continuum in general.  Surprisingly, while both analysts and vendors frequently discuss the concept, the Wikipedia entry for information continuum is weak, and I couldn’t easily find a nice picture of it, so I decided to make my own.

The general idea that information spans a continuum with regard to structure is pretty much undisputed.  The placement of any given type of information on that continuum is more problematic.  While it seems clear the purchase orders are highly structured and that free text is not, the placement of, for example, email is more interesting.  Some might argue that email is unstructured.  In fact, only the body of an email is unstructured and there is plenty of metadata (e.g., from, send-to, date, subject) wrapping an email.  In addition, an email’s body actually does have latent structure — while it may not be explicit, you typically have a salutation followed by numerous paragraphs of text, a sign-off, a signature, and perhaps a legal footer.  Email is unquestionably semi-structured.

In fact, I believe that the vast majority of information is semi-structured.  PowerPoint decks have slides, slides have titles and bullets.  Contracts are typically word documents, but have more-or-less standard sections.  Proposals are usually Word or PowerPoint documents that tend to have similar structures.  Even the humble tweet is semi-structured:  while the contents are ostensibly 140 unstructured characters, the anatomy of a tweet reveals lots of metadata (e.g., location) and even the contents contain some structural information (e.g,. RT indicating re-tweet or #hashtags serving as topical metadata).

New let’s consider XML content.  Some would argue that XML is definitionally structured.  But I’d say that an arbitrary set of documents all stored within <document> and </document> tags is only faux structured; it appears structured because it’s XML, but the XML is just used as a container.  A corpus of twenty 2,000-page medical textbooks in 6 different schemas is indeed structured, but not well so.  To paraphrase an old saw about standards:  the nice thing about structures is that there are so many to choose from.  I believe that knowing content is marked up in XML reveals nothing about its structure, i.e., that XML-ness and structure are orthogonal.  Put differently, XML is simply a means of representing information.  The information represented may be highly structured (e.g., 100 purchase orders all in perfect adherence to a given schema) or highly unstructured (e.g., 20 documents only vaguely complying with 20 different schemas).

I have two primary beliefs about the information continuum:

  • The vast majority of information is semi-structured. There is relatively little highly structured and relatively little completely unstructured information out there.  Most information lies somewhere in the fat middle.  I overlaid a bell curve on top of the information continuum to reflect volume.
  • Even information that initially appears structured is often semi-structured.  I see three types of this subtly semi-structured information which, hopefully without being too cute, I’ll abbreviate as SSSI.  The three types are (1) schema as aspiration, (2)  time-varying schema, and (3) unknowable schema.

Let’s look at each of the three types more closely.

Schema as Aspiration

The first type of subtly semi-structured information (SSSI) is where a schema exists, but only notionally.  The schema itself is either poorly defined (actual quote:  “it is believed that this element is used for”) or well defined but not followed.  This is frequently the case with publishing and media companies.  Here are two free jokes that work well at any publishing conference:

  • Raise your hand if you have a standard schema.  Keep it up if your content actually adheres to it.
  • Oxymorons aside, how many of you have 3 or more “standard” schemas, 5 or more, … do  I hear 10?

These jokes are funny because of the state of the content.  This state is the result of two primary business trends:  (1) consolidation — most large publishers have been built through M&A thus inheriting numerous different standards, each of which may be only partly implemented — and (2) licensing — publishers frequently license content from numerous other sources, each with its own standard format.

Time-Varying Schema

The second case of SSSI is you where you have a well defined, enforced schema at any moment in time, but it keeps changing over time.  Typically this happens for one of two reasons:

  • The business reality that you’re modeling is changing.  For example, in 2009 Federal Sales was part of Eastern Sales but in 2010 it becomes its own division.  This makes comparison of Eastern results between 2009 and 2010 potentially difficult.  In BI circles, this is known as the slow-changing dimension problem.
  • Standards keep changing.  If you’re modeling information in a corporate- or industry-standard schema and that schema is changing, then your information becomes semi-structured because it is contained within multiple different schemas.  Sometimes you can avoid this by migrating all prior information to the current schema, but sometimes (e.g., massive data volumes, regulatory desire to not change existing records) you will not.

When viewed with a flash camera this information looks well structured.  When you look at the movie, you can clearly see that it’s not.

Unknowable Schema

The last case of SSSI is where you have an unknowable schema.  Consider terrorist tracking.  If you were to make a schema for a terrorist database, here are some of the attributes that spring to mind:  name, alias(es), address, former address(es), height, weight, hair color, eye color, member-of, enemy-of, friend-of, tattoos/markings.

Here are some problems with this:

  • Many of the attributes are multi-valued, such as alias or friend-of.  In a de-normalized approach, this means dealing with repeating group problems and creating N columns (e.g., alias, alias1, alias2, and up to the maximum number of aliases for any terrorist).  Normalization would take care of the repeating group but at the cost of creating a table for each multi-valued attribute and then having to join back to those tables when you run queries.  (One such real system ended up with 500 tables, with the result that no one could find anything.)
  • It is difficult to create a type for the tattoo attribute.  First, it’s multi-valued.  Second, while tattoos are sometimes images, they often contain text (e.g., Mom) and sometimes in a foreign language (e.g., 愛, the Chinese symbol for love).  Since you’re trying to secure the nation against threat you don’t want to throw away any potentially valuable information, but it’s not obvious how to store this.
  • New attributes are coming all the time.  Say you get a shoe print on a suspect as he runs away.  You need to add a shoe-size attribute to the database.  Say a terrorist runs away and leaves a pair of eyeglasses.  Now we need to add eyeglass prescription.  My favorite is what’s called pocket litter.  You find a piece of paper in a person’s pocket and it has a number on it.  It could be a phone number, a  lock combination, or maybe map coordinates.  You don’t know what it is — but again, since you don’t want to throw any potentially valuable information — you have to find a place to store it.
  • Combining an enormous number of potential attributes with the reality that very few are known for most individuals creates two problems:  (1) you end up with a sparse table which is not well handled in most RDBMSs and (2) you end up hitting column limits.

Another example of unknowable schemas would be in financial services, modeling derivatives.   Because derivatives are sometimes long-lived instruments (e.g., 30 years) you may face the time-varying schema problem.  In addition, you have the unknowable schema problem because the industry is constantly creating new products.  First we had CDOs and CDSs on banks, then single-tranche CDOs, then CDSs on single-tranche CDOs, and then synthetic CDOs.  If this makes your head hurt in terms of understanding, then think for a minute about data modeling.  How are you going to store these complex products in a database?   And what are you going to do with the never-ending stream of new ones — last I heard they were considering selling derivatives on movies.

(As it turns out XML is a great way to model both these problems as you can easily add new attributes on the fly and only provide values for attributes where you know them.)

To finish the post, I’ll revisit the statement I started with:  we generally refer to MarkLogic Server as an XML server, a special-purpose database management system (DBMS) for unstructured information.  Going forward, I think I’ll keep saying that because it’s simpler, but at the MarkLogic 201 level, the more precise statement is:  a special-purpose DBMS for semi-structured information.

There’s way more semi-structured information out there.  Realizing that information is semi-structured is sometimes subtle.  And semi-structured information is, in fact, the optimization point for our product.  So what’s MarkLogic in three concepts?  Speed, scale, and semi-structured information.