Category Archives: NoSQL

It Ain’t Easy Making Money in Open Source:  Thoughts on the Hortonworks S-1

It took me a week or so to get to it, but in this post I’ll take a dive into the Hortonworks S-1 filing in support of a proposed initial public offering (IPO) of their stock.

While Hadoop and big data are unarguably huge trends driving the industry and while the future of Hadoop looks very bright indeed, on reading the Hortonworks S-1, the reader is drawn to the inexorable conclusion that  it’s hard to make money in open source, or more crassly, it’s hard to make money when you give the shit away.

This is a company that,  in the past three quarters, lost $54M on $33M of support/services revenue and threw in $26M in non-recoverable (i.e., donated) R&D atop that for good measure.

Let’s take it top to bottom:

  • They have solid bankers: Goldman Sachs, Credit Suisse, and RBC are leading the underwriting with specialist support from Pacific Crest, Wells Fargo, and Blackstone.
  • They have an awkward, jargon-y, and arguably imprecise marketing slogan: “Enabling the Data-First Enterprise.”  I hate to be negative, but if you’re going to lose $10M a month, the least you can do is to invest in a proper agency to make a good slogan.
  • Their mission is clear: “to establish Hadoop as the foundational technology of the modern enterprise data architecture.”
  • Here’s their solution description: “our solution is an enterprise-grade data management platform built on a unique distribution of Apache Hadoop and powered by YARN, the next generation computing and resource management framework.”
  • They were founded in 2011, making them the youngest company I’ve seen file in quite some years. Back in the day (e.g., the 1990s) you might go public at age 3-5, but these days it’s more like age 10.
  • Their strategic partners include Hewlett-Packard, Microsoft, Rackspace, Red Hat, SAP, Teradata, and Yahoo.
  • Business model:  “consistent with our open source approach, we generally make the Hortonworks Data Platform available free of charge and derive the predominant amount of our revenue from customer fees from support subscription offerings and professional services.”  (Note to self:  if you’re going to do this, perhaps you shouldn’t have -35% services margins, but we’ll get to that later.)
  • Huge market opportunity: “According to Allied Market Research, the global Hadoop market spanning hardware, software and services is expected to grow from $2.0 billion in 2013 to $50.2 billion by 2020, representing a compound annual growth rate, or CAGR, of 58%.”  This vastness of the market opportunity is unquestioned.
  • Open source purists: “We are committed to serving the Apache Software Foundation open source ecosystem and to sharing all of our product developments with the open source community.”  This one’s big because while it’s certainly strategic and it certainly earns them points within the Hadoop community, it chucks out one of the better ways to make money in open source:  proprietary versions / extensions.  So, right or wrong, it’s big.
  • Headcount:  The company has increased the number of full-time employees from 171 at December 31, 2012 to 524 at September 30, 2014

Before diving into the financials, let me give readers a chance to review open source business models (Wikipedia, Kellblog) if they so desire, before making the (generally true but probably slightly inaccurate) assertion:  the only open source company that’s ever made money (at scale) is Red Hat.

Sure, there have been a few great exits.  Who can forget MySQL selling to Sun for $1B?  Or VMware buying SpringSource for $420M?  Or RedHat buying JBoss for $350M+?  (Hortonworks CEO Rob Bearden was involved in both of the two latter deals.)   Or Citrix buying XenSource for $500M?

But after those deals, I can’t name too many others.  And I doubt any of those companies was making money.

In my mind there are a two common things that go wrong in open source:

  • The market is too small. In my estimation open source compresses the market size by 10-20x.  So if you want to compress the $30B DBMS market 10x, you can still build several nice companies.  However, if you want to compress the $1B enterprise search market by 10x, there’s not much room to build anything.  That’s why there is no Red Hat of Lucene or Solr, despite their enormous popularity in search.    For open source to work, you need to be in a huge market.
  • People don’t renew. No matter which specific open source business model you’re using, the general play is to sell a subscription to <something> that complements your offering.  It might be a hardened/certified version of the open source product.  It might be additions to it that you keep proprietary forever or, in a hardcover/paperback analogy, roll back into the core open source projects with a 24 month lag.  It might be simply technical support.  Or, it might be “admission the club” as one open source CEO friend of mine used to say:  you get to use our extensions, our support, our community, etc.  But no matter what you’re selling, the key is to get renewals.  The risk is that the value of your extensions decreases over time and/or customers become self-sufficient.    This was another problem with Lucene.  It was so good that folks just didn’t need much help and if they did, it was only for a year or so.

So Why Does Red Hat work?

Red Hat uses a professional open source business model  applied to primarily two low-level infrastructure categories:  operating systems and later middleware.   As general rules:

  • The lower-level the category the more customers want support on it.
  • The more you can commoditize the layers below you, the more the market likes it. Red Hat does this for servers.
  • The lower-level the category the more the market actually “wants” it standardized in order to minimize entropy. This is why low-level infrastructure categories become natural monopolies or oligopolies.

And Red Hat set the right price point and cost structure.  In their most recent 10-Q, you can see they have 85% gross margins and about a 10% return on sales.  Red Hat nailed it.

But, if you believe this excellent post by Andreessen Horowitz partner Peter Levine, There Will Never Be Another Red Hat.  As part of his argument Levine reminds us that while Red Hat may be a giant among open source vendors, that among general technology vendors they are relatively small.  See the chart below for the market capitalization compared to some megavendors.

rhat small fish

Now this might give pause to the Hadoop crowd with so many firms vying to be the Red Hat of Hadoop.  But that hasn’t stopped the money from flying in.  Per Crunchbase, Cloudera has raised a stunning $1.2B in venture capital, Hortonworks has raised $248M, and MapR has raised $178M.  In the related Cassandra market, DataStax has raised $190M.  MongoDB (with its own open source DBMS) has raised $231M.  That’s about $2B invested in next-generation open source database venture capital.

While I’m all for open source, disruption, and next-generation databases (recall I ran MarkLogic for six years), I do find the raw amount of capital invested pretty crazy.   Yes, it’s a huge market today.  Yes, it’s exploding as do data volumes and the new incorporation of unstructured data.  But we will be compressing it 10-20x as part of open-source-ization.  And, given all the capital these guys are raising – and presumably burning (after all, why else would you raise it), I can assure you that no one’s making money.

Hortonworks certainly isn’t — which serves as a good segue to dive into the financials.  Here’s the P&L, which I’ve cleaned up from the S-1 and color-annotated.

horton pl

  •  $33M in trailing three quarter (T3Q) revenues ($41.5M in TTM, though not on this chart)
  • 109% growth in T3Q revenues
  • 85% gross margins on support
  • Horrific -35% gross margins on services which given the large relative size of the services business (43% of revenues) crush overall gross margins down to 34%
  • More scarily this calls into question the veracity of the 85% subscription gross margins — I recall reading in the S-1 that they current lack VSOE for subscription support which means that they’ve not yet clearly demonstrated what is really support revenue vs. professional services revenue.  [See footnote 1]
  • $26M in T3Q R&D expense.  Per their policy all that value is going straight back to the open source project which begs the question will they ever see return on it?
  • Net loss of $86.7M in T3Q, or nearly $10M per month

Here are some other interesting tidbits from the S-1:

  • Of the 524 full-time employee as of 9/30/14, there are 56 who are non-USA-based
  • CEO makes $250K/year in base salary cash compensation with no bonus in FY13 (maybe they missed plan despite strong growth?)
  • Prior to the offering CEO owns 6.8% of the stock, a pretty nice percentage, but he was a kind-of a founder
  • Benchmark owns 18.7%
  • Yahoo owns 19.6%
  • Index owns 9.5%
  • $54.9M cash burn from operations in T3Q, $6.1M per month
  • Number of support subscription customers has grown from 54 to 233 over the year from 9/30/13 to 9/30/14
  • A single customer represented went from 47% of revenues for the T3Q ending 9/30/13 down to 22% for the T3Q ending 9/30/14.  That’s a lot of revenue concentration in one customer (who is identified as “Customer A,” but who I believe is Microsoft based on some text in the risk factors.)

Here’s a chart I made of the increase in value in the preferred stock.  A ten-bagger in 3 years.

horton pref

One interesting thing about the prospectus is they show “gross billings,” which is an interesting derived metric that financial analysts use to try and determine bookings in a subscription company.  Here’s what they present:

horton billings

While gross billings is not a bad stab at bookings, the two metrics can diverge — primarily when the duration of prepaid contracts changes.  Deferred revenue can shoot up when sales sells longer prepaid contracts to a given number of customers as opposed to the same-length contract to more of them.  Conversely, if happy customers reduce prepaid contract duration to save cash in a downturn, it can actually help the vendor’s financial performance (they will get the renewals because the customer is happy and not discount in return for multi-year), but deferred revenue will drop as will gross billings.  In some ways, unless prepaid contract duration is held equal, gross billings is more of a dangerous metric than anything else.  Nevertheless Hortonworks is showing it as an implied metric of bookings or orders and the growth is quite impressive.

Sales and Marketing Efficiency

Let’s now look at sales and marketing efficiency, not using the CAC which is too hard to calculate for public companies but using JMP’s sales and marketing efficiency metric = gross profit [current] – gross profit [prior] / S&M expense [prior].

On this metric Hortonworks scores a 41% for the T3Q ended 9/30/14 compared to the same period in 2013.  JMP considers anything above 50% efficient, so they are coming in low on this metric.  However, JMP also makes a nice chart that correlates S&M efficiency to growth and I’ve roughly hacked Hortonworks onto it here:

JMP

I’ll conclude the main body of the post by looking at their dollar-based expansion rate.  Here’s a long quote from the S-1:

Dollar-Based Net Expansion Rate.    We believe that our ability to retain our customers and expand their support subscription revenue over time will be an indicator of the stability of our revenue base and the long-term value of our customer relationships. Maintaining customer relationships allows us to sustain and increase revenue to the extent customers maintain or increase the number of nodes, data under management and/or the scope of the support subscription agreements. To date, only a small percentage of our customer agreements has reached the end of their original terms and, as a result, we have not observed a large enough sample of renewals to derive meaningful conclusions. Based on our limited experience, we observed a dollar-based net expansion rate of 125% as of September 30, 2014. We calculate dollar-based net expansion rate as of a given date as the aggregate annualized subscription contract value as of that date from those customers that were also customers as of the date 12 months prior, divided by the aggregate annualized subscription contract value from all customers as of the date 12 months prior. We calculate annualized support subscription contract value for each support subscription customer as the total subscription contract value as of the reporting date divided by the number of years for which the support subscription customer is under contract as of such date.

This is probably the most critical section of the prospectus.  We know Hortonworks can grow.  We know they have a huge market.  We know that market is huge enough to be compressed 10-20x and still have room to create a a great company.  What we don’t know is:  will people renew?   As we discussed above, we know it’s one of the great risks of open source

Hortonworks pretty clearly answers the question with “we don’t know” in the above quote.  There is simply not enough data, not enough contracts have come up for renewal to get a meaningful renewal rate.  I view the early 125% calculation as a very good sign.  And intuition suggests that — if their offering is quality — that people will renew because we are talking low-level, critical infrastructure and we know that enterprises are willing to pay to have that supported.

# # #

Appendix

In the appendix below, I’ll include a few interesting sections of the S-1 without any editorial comments.

A significant portion of our revenue has been concentrated among a relatively small number of large customers. For example, Microsoft Corporation historically accounted for 55.3% of our total revenue for the year ended April 30, 2013, 37.8% of our total revenue for the eight months ended December 31, 2013 and 22.4% of our total revenue for the nine months ended September 30, 2014. The revenue from our three largest customers as a group accounted for 71.0% of our total revenue for the year ended April 30, 2013, 50.5% of our total revenue for the eight months ended December 31, 2013 and 37.4% of our total revenue for the nine months ended September 30, 2014. While we expect that the revenue from our largest customers will decrease over time as a percentage of our total revenue as we generate more revenue from other customers, we expect that revenue from a relatively small group of customers will continue to account for a significant portion of our revenue, at least in the near term. Our customer agreements generally do not contain long-term commitments from our customers, and our customers may be able to terminate their agreements with us prior to expiration of the term. For example, the current term of our agreement with Microsoft expires in July 2015, and automatically renews thereafter for two successive twelve-month periods unless terminated earlier. The agreement may be terminated by Microsoft prior to the end of its term. Accordingly, the agreement with Microsoft may not continue for any specific period of time.

# # #

We do not currently have vendor-specific objective evidence of fair value for support subscription offerings, and we may offer certain contractual provisions to our customers that result in delayed recognition of revenue under GAAP, which could cause our results of operations to fluctuate significantly from period-to-period in ways that do not correlate with our underlying business performance.

In the course of our selling efforts, we typically enter into sales arrangements pursuant to which we provide support subscription offerings and professional services. We refer to each individual product or service as an “element” of the overall sales arrangement. These arrangements typically require us to deliver particular elements in a future period. We apply software revenue recognition rules under U.S. generally accepted accounting principles, or GAAP. In certain cases, when we enter into more than one contract with a single customer, the group of contracts may be so closely related that they are viewed under GAAP as one multiple-element arrangement for purposes of determining the appropriate amount and timing of revenue recognition. As we discuss further in “Management’s Discussion and Analysis of Financial Condition and Results of Operations—Critical Accounting Policies and Estimates—Revenue Recognition,” because we do not have VSOE for our support subscription offerings, and because we may offer certain contractual provisions to our customers, such as delivery of support subscription offerings and professional services, or specified functionality, or because multiple contracts signed in different periods may be viewed as giving rise to multiple elements of a single arrangement, we may be required under GAAP to defer revenue to future periods. Typically, for arrangements providing for support subscription offerings and professional services, we have recognized as revenue the entire arrangement fee ratably over the subscription period, although the appropriate timing of revenue recognition must be evaluated on an arrangement-by-arrangement basis and may differ from arrangement to arrangement. If we are unexpectedly required to defer revenue to future periods for a significant portion of our sales, our revenue for a particular period could fall below  our expectations or those of securities analysts and investors, resulting in a decline in our stock price

 # # #

We generate revenue by selling support subscription offerings and professional services. Our support subscription agreements are typically annual arrangements. We price our support subscription offerings based on the number of servers in a cluster, or nodes, data under management and/or the scope of support provided. Accordingly, our support subscription revenue varies depending on the scale of our customers’ deployments and the scope of the support agreement.

 Our early growth strategy has been aimed at acquiring customers for our support subscription offerings via a direct sales force and delivering consulting services. As we grow our business, our longer-term strategy will be to expand our partner network and leverage our partners to deliver a larger proportion of professional services to our customers on our behalf. The implementation of this strategy is expected to result in an increase in upfront costs in order to establish and further cultivate such strategic partnerships, but we expect that it will increase gross margins in the long term as the percentage of our revenue derived from professional services, which has a lower gross margin than our support subscriptions, decreases.

 # # #

Deferred Revenue and Backlog

Our deferred revenue, which consists of billed but unrecognized revenue, was $47.7 million as of September 30, 2014.

Our total backlog, which we define as including both cancellable and non-cancellable portions of our customer agreements that we have not yet billed, was $17.3 million as of September 30, 2014. The timing of our invoices to our customers is a negotiated term and thus varies among our support subscription agreements. For multiple-year agreements, it is common for us to invoice an initial amount at contract signing followed by subsequent annual invoices. At any point in the contract term, there can be amounts that we have not yet been contractually able to invoice. Until such time as these amounts are invoiced, we do not recognize them as revenue, deferred revenue or elsewhere in our consolidated financial statements. The change in backlog that results from changes in the average non-cancelable term of our support subscription arrangements may not be an indicator of the likelihood of renewal or expected future revenue, and therefore we do not utilize backlog as a key management metric internally and do not believe that it is a meaningful measurement of our future revenue.

 # # #

We employ a differentiated approach in that we are committed to serving the Apache Software Foundation open source ecosystem and to sharing all of our product developments with the open source community. We support the community for open source Hadoop, and employ a large number of core committers to the various Enterprise Grade Hadoop projects. We believe that keeping our business model free from architecture design conflicts that could limit the ultimate success of our customers in leveraging the benefits of Hadoop at scale is a significant competitive advantage.

 # # #

International Data Corporation, or IDC, estimates that data will grow exponentially in the next decade, from 2.8 zettabytes, or ZB, of data in 2012 to 40 ZBs by 2020. This increase in data volume is forcing enterprises to upgrade their data center architecture and better equip themselves both to store and to extract value from vast amounts of data. According to IDG Enterprise’s Big Data Survey, by late 2014, 31% of enterprises with annual revenues of $1 billion or more expect to manage more than one PB of data. In comparison, as of March 2014 the Library of Congress had collected only 525 TBs of web archive data, equal to approximately half a petabyte and two million times smaller than a zettabyte.

# # #

Footnotes:

[1]  Thinking more about this, while I’m not an accountant, I think the lack of VSOE has the following P&L impact:  it means that in contracts that mix professional services and support they must recognize all the revenue ratably over the contract.  That’s fine for the support revenue, but it should have the effect of pushing out services revenue, artificially depressing services gross margins.  Say, for example you did a $240K that was $120K of each.  The support should be recognized at $30K/quarter.  However, if the consulting is delivered in the first six months it should be delivered at $60K/quarter for the first and second quarters and $0 in the third and fourth.  Since, normally, accountants will take the services costs up-front this should have the effect of hurting services by taking the costs as delivered but by the revenue over a longer period.

[2] See here for generic disclaimers and please note that in the past I have served as an advisor to MongoDB

Thoughts on MongoDB’s Humongous $150M Round

Two weeks ago MongoDB, formerly known as 10gen, announced a massive $150M funding round said to be the largest in the history of databases lead by Fidelity, Altimeter, and Salesforce.com with participation from existing investors Intel, NEA, Red Hat, and Sequoia.  This brings the total capital raised by MongoDB to $231M, making it the best-funded database / big data technology of all time.

What does this mean?

The two winners of the next-generation NoSQL database wars have been decided:  MongoDB and Hadoop.  The faster the runner-ups  figure that out, the faster they can carve off sensible niches on the periphery of the market instead of running like decapitated chickens in the middle. [1]

The first reason I say this is because of the increasing returns (or, network effects) in platform markets.  These effects are weak to non-existent in applications markets, but in core platform markets like databases, the rich invariably get richer.  Why?

  • The more people that use a database, the easier it is to find people to staff teams so the more likely you are to use it.
  • The more people that use a database, the richer the community of people you can leverage to get help
  • The more people that build applications atop a database, the less perceived risk there is in building a new application atop it.
  • The more people that use a database, the more jobs there are around it, which attracts more people to learn how to use it.
  • The more people that use a database, the cooler it is seen to be which in turn attracts more people to want to learn it.
  • The more people that use a database, the more likely major universities are to teach how to use it in their computer science departments.

To see just how strong MongoDB has become in this regard, see here.  My favorite analysis is the 451 Groups’ LinkedIn NoSQL skills analysis, below.

linkedinq31

This is why betting on horizontal underdogs in core platform markets is rarely a good idea.  At some point, best technology or not, a strong leader becomes the universal safe choice.  Consider 1990 to about 2005 where the relational model was the chosen technology and the market a comfortable oligopoly ruled by Oracle, IBM, and Microsoft.

It’s taken 30+ years (and numerous prior failed attempts) to create a credible threat to the relational stasis, but the combination of three forces is proving to be a perfect storm:

  • Open source business models which cut costs by a factor of 10
  • Increasing amounts of data in unstructured data types which do not map well to the relational model.
  • A change in hardware topology to from fewer/bigger computers to vast numbers of smaller ones.

While all technologies die slowly, the best days of relational databases are now clearly behind them.  Kids graduating college today see SQL the way I saw COBOL when I graduated from Berkeley in 1985.  Yes, COBOL was everywhere.  Yes, you could easily get a job programming it.  But it was not cool in any way whatsoever and it certainly was not the future.  It was more of a “trade school” language than interesting computer science.

The second reason I say this is because of my experience at Ingres, one of the original relational database providers which — despite growing from ~$30M to ~$250M during my tenure from 1985 to 1992 — never realized that it had lost the market and needed a plan B strategy.  In Ingres’s case (and with full 20/20 hindsight) there was a very viable plan B available:  as the leader in query optimization, Ingres could have easily focused exclusively on data warehousing at its dawn and become the leader in that segment as opposed to a loser in the overall market.  Yet, executives too often deny market reality, preferring to die in the name of “going big” as opposed to living (and prospering) in what could be seen as “going home.”  Runner-up vendors should think hard about the lessons of Ingres.

The last reason I say this is because of what I see as a change in venture capital. In the 1980s and 1990s VCs used to fund categories and cage-fights.  A new category would be identified, 5-10 companies would get created around it, each might raise $20-$30M in venture capital and then there would be one heck of a cage-fight for market leadership.

Today that seems less true.  VCs seem to prefer funding companies to categories.  (Does anyone know what category Box is in?  Does anyone care about any other vendor in it?)  Today, it seems that VCs fund fewer players, create fewer cage-fights, and prefer to invest much more, much later in a company that appears to be a clear winner.

This, so-called “momentum investing” itself helps to anoint winners because if Box can raise $309M, then it doesn’t really matter how smart the folks at WatchDox are or how clever their technology.

MongoDB is in this enviable position in the next-generation (open source) NoSQL database market.  It has built a huge following, that huge following is attracting a huge-r (sorry) following.  That cycle is attracting momentum investors who see MongoDB as the clear leader.  Those investors give MongoDB $150M.

By my math, if entirely invested in sales [2], that money could fund hiring some 500 sales teams who could generate maybe $400M a year in incremental revenue.  Which would in turn will attract more users.  Which would make the community bigger.  Which would de-risk using the system.  Which would attract more users.

And, quoting Vonnegut, so it goes.

# # #

Disclaimer:  I own shares in several of the companies mentioned herein as well as competitors who are not.  See my FAQ for more.

[1] Because I try to avoid writing about MarkLogic, I should be clear that while one can (and I have) argued that MarkLogic is a NoSQL system, my thinking has evolved over time and I now put much more weight on the open-source test as described in the “perfect storm” paragraph above.  Ergo, for the purposes of this post, I exclude MarkLogic entirely from the analysis because they are not in the open-source NoSQL market (despite the 451′s including them in their skills index).  Regarding MarkLogic, I have no public opinion and I do not view MongoDB’s or Hadoop’s success as definitively meaning either anything either good or bad for them.

[2] Which, by the way, they have explicitly said they will not do.  They have said, “the company will use these funds to further invest in the core MongoDB project as well as in MongoDB Management Service, a suite of tools and services to operate MongoDB at scale. In addition, MongoDB will extend its efforts in supporting its growing user base throughout the world.”

Interview by SandHill.com on Big Data, Cloud Computing, and the Future of IT

[This is a re-post of a recent interview with me, authored by Darren Cunningham of Informatica.  The post originally appeared on SandHill.com where Darren writes a column on Cloud Computing.]

—-

The Cloud in Action

Big Data, Cloud Computing and Industry Perspectives with Dave Kellogg

BY Darren Cunningham

I had the pleasure of working with Dave Kellogg early in my marketing career and continue to learn from him as a regular subscriber to his popular blog, Kellblog. A seasoned Silicon Valley executive, Dave has been a board member (Aster Data), CEO (MarkLogic), CMO (Business Objects) and VP of Marketing (Versant and Ingres). I recently sat down with Dave to discuss industry trends. As always, he didn’t hold back.

Dave, you’ve written a lot about “Big Data” on your blog. Why is it such a hot topic in the world of data management?

First I think Big Data is a hot topic because it represents the first time in about 30 years that people are rethinking databases. Literally, since about 1980 people haven’t had to think much about databases. If you were an SMB, you went SQL server; if you were enterprise, you’d go Oracle or IBM depending on your enterprise preferences. But in terms of technology, to paraphrase Henry Ford: any color you want, as long it’s relational.

Overall, I think Big Data is hot for three reasons:

  • Major new innovation is finally happening with databases for the first time in three decades.
  • Hardware architectures have changed — people want to scale horizontally like Google.
  • We are experiencing a serious explosion in the amount of data people are analyzing and managing. Machine-generated data, the exhaust of the Web, is driving a lot of it.

I think Big Data is challenging on many fronts from the cool (e.g., analytics and query optimization), to the practical (e.g., horizontal scaling), to the mundane (e.g., backup and recovery).

What’s the intersection with Cloud Computing?

I think when people say cloud computing, they mean one of several things:

  • SaaS: The use of software applications or platforms as services.
  • Dynamic scaling: My favorite example of this is Britain’s Got Talent, which uses Cassandra. Most of the time they have nothing to do. Then one night half the country is trying to vote for their favorite contestants.
  • Service orientation: The ability to weave together applications by calling various cloud services — in effect using a series of cloud services as a platform on which to build applications.

I think Big Data intersects with cloud in several ways. First, the people running cloud services are dealing with Big Data problems. They are hosting thousands of customers’ databases and generating log records from hundreds of thousands of users. I also think Big Data analytics are very dynamic loads. One minute you want nothing, then suddenly you need to throw 100 servers at a complex problem for several hours.

How do you see these trends changing the role of IT?

I think corporate IT is constantly evolving because smart corporations want their internal resources focused on activities that they can’t buy elsewhere and that generate competitive advantage for the business.

IT used to buy and run computers. Then they used to build and run applications. Then they focused on weaving together packaged applications. Going forward, they will focus on tightly integrating cloud-based services. They will also continue to focus on company-proprietary analytics used to gain competitive advantage.

The other trend driving IT is consumerization. The Web sets expectations for functionality, user interface and quality that corporate IT must meet with internal systems. The bar has gone way up – people won’t tolerate old-school ERP-style interfaces at work when they’re used to Facebook or Yelp.

What does that mean for technology sales and marketing?

If Mr. McGuire in The Graduate were dishing out advice today, instead of saying “plastics,” he’d say “data science.” More and more companies will use data scientists to analyze their business and drive tactical operations. First you need to gather a whole bunch of data about your operations and customers. Then you need to throw world-class data analysts at it to get business value and to be sure you don’t draw false conclusions – e.g., mixing causality with correlation.

Today, most companies have their sales departments on salesforce.com. Leading marketing departments are on Marketo or Eloqua, but most marketers still don’t have much technology backing them. Going forward you will see a whole class of analytics applications vendors providing advanced analytics for Salesforce (e.g., Cloud9, Good Data) and the marketing automation vendors will move beyond lead incubation into providing overall marketing suites. I expect Marekto or Eloqua to try to do for the chief marketing officer what SuccessFactors did for the chief people officer – and if they don’t, then there’s a real opportunity for someone else.

Speaking of all things cloud, you often write about Silicon Valley trends. How would you characterize what’s going on in the market right now?

From my perception, the Silicon Valley innovation engine is running full out. Top VCs are raising new funds. I meet a few new startups every day. Of late, I’ve met fascinating companies in next-generation business intelligence, analytics, Big Data, social media monitoring and exploitation and Web application development. One of the more interesting things I’ve found is a VC fund dedicated to big data - IA Ventures (in New York). When I heard about them, I thought: oh, lots of Big Data infrastructure and platform technologies. Then I spent some time and realized that most of their portfolio is about exploiting new Big Data infrastructure technologies via vertical applications. That was really interesting.

People will debate whether we’re in a mini tech bubble or a social networking-specific bubble. Who knows? I just read an article in the The Wall Street Journal that argues $140B valuation for Facebook is realistic, and it was fairly convincing. So you can debate the bubble issue but you can’t debate that the IPO market has been closed for a long time. Now it is starting to open, and that’s a huge change in Silicon Valley.

Entrepreneurs have historically dreamed of creating $1B independent companies. I’d say for most of the last decade they’ve dreamed of getting bought for 5-10x revenues. Michael Arrington had a great quote a while back saying that “an entire generation of entrepreneurs [has been lost] building dipshit companies that sell to Google for $25M.” I think those days are over. When the IPO window opens, people dream of building stand-alone companies.

What advice do you have for both entrepreneurs and IT veterans?

Don’t build or run things that you can buy or rent. If you follow that mantra, you will follow market trends, and always stay at the right stack-layer to ensure that you are adding value as opposed to leveraging old skill sets. While you may know how to run a Big Data center, you can now rent time in one more cost-effectively. So either go work for a company that runs data centers (e.g., Equinix) if that’s your pleasure, or go leverage the people who do. Put differently, don’t be static. If you’re still using skills you learned 10 years ago, make sure that you’re not teeing yourself up to get left behind.

As always, great advice, Dave! Thank you.

Darren Cunningham is VP of Marketing for Informatica Cloud.

[Notes:  Minor changes made from the SandHill post.  I added emphasis via bolding and I corrected the attribution of the famous lines “plastics” from The Graduate.  It was not Mr. Robinson, but Mr. McGuire, who said it.]

Open Source Business Models, Revisited

I had breakfast the other day with Mike Olson, CEO of Hadoop ecosystem leader, Cloudera.  We met because we run in similar circles in data management land and because Mike had some quibbles with my post, The Open Source Software Paradox.

My premise was that open source presents a fundamental paradox:   the larger the community, the better the software, and the less people need to buy support for it.  Thus, that open source market opportunities were inherently flawed / paradoxical because you could only sell services for projects  that were not terribly successful.  Simply put,

You can have a large community who doesn’t need to buy from you or a small community who does.

I think Mike’s overall take on my post was “1990s thinking” because things have evolved over the past decade and businesses now try to monetize open source opportunities in more sophisticated ways.  This approach doesn’t actually contradict the paradox I observed, but instead looks  for more creative ways around it.

Another key point Mike made was that open source is not a business model.  I agree.  Open source is a way of developing software.  There are many different possible business models for monetizing open source projects.

Rather than attempt to replay the back-and-forth of our discussion, I will simply list my revised take on the 4 basic open source business models.

  • Professional services.  The most basic way to make money around an open source project is to offer related consulting (and training) services.  For example, ThinkBigAnalytics, seems to  building a consulting business around Hadoop and NoSQL databases (most of which are also open source).
  • Dual licensing.  A vendor offers (1) a free version under the GPL license which freely enables internal use but contaminates on redistribution and (2) a paid version under a different license that doesn’t include GPL’s copyleft provisions.  This model reeks of the vig as you force people under threat (of open sourcing their system) if they don’t move to the non-GPL version.  In addition, since SaaS or cloud services use but don’t redistribute software, this approach loses its teeth in the SaaS / cloud world.
  • Open core.  A vendor promotes an open source version of a system and makes money by extending it with proprietary additions.  In this model, the vendor “has some IP” and is not totally dependent on support subscriptions which may or may not be renewed.  Cloudera is executing this strategy by offering both (1) the Cloudera Distribution on an Apache license as well as (2) Cloudera Enterprise which is built on the Cloudera Distribution but also includes production support and management applications.

The open core model clearly sidesteps the paradox I’d outlined because open core vendors offer more than support.  Open core is a freemium business model and possesses all the strengths and suffers from all the weaknesses of other freemium models.

  • First, can you build a large community on the free version or service?
  • Second, through what mechanism and at what cost you monetize members of that community to a higher-level service?
  • Third, once monetized at what rate can you keep premium members renewing the premium service or moving them up to an even higher service level?

LinkedIn has done freemium spectacularly well.  I’ve never paid them a dime (as a free service user) but somebody paid them the ~$250M they made in the first 9 months of the year.  (Turns out it’s about 33% each of premium subscriptions, hiring solutions, and marketing solutions.)

The newspapers still haven’t figured out freemium though FT and The New York Times are making headway.

How will open core play out for open source vendors?  I don’t know.  I do know the freemium code is hard to crack.  I do know that freemium models are constantly evolving.  I do believe that freemium is a better business model than simply offering support or services.  And with the  IPO window opening, I do believe we may get a chance to see the financials of a few open core companies in the coming years.

Max Schireson Appointed President of MongoDB Company, 10gen

This is a quick post to congratulate Max Schireson on his appointment to President of 10gen, the company behind red-hot NoSQL database MongoDB.

Quoting their press release:

“Max brings to 10gen a strong understanding of the database market, both from the perspective of an established market leader and an upstart alternative technology,” said Dwight Merriman, co-founder and CEO of 10gen. “Usage of MongoDB has been growing explosively; adding Max to the team will help us scale the company to keep up with the interest in our technology.”

Like MarkLogic, MongoDB is a highly-scalable, document-oriented, and schema-free database system.  Unlike MarkLogic, MongoDB is open source, JSON-oriented, and in use at many web 2.0 startups like foursquare, Etsy, bit.ly, github, bump, Disqus, EventBrite, and others.  While MarkLogic uses a query language (XQuery) to access the DBMS, MongoDB is what I call a “revenge of the programmers” database that, like object databases back in the day, are preferred by programmers for their interfaces to popular languages.

I had the pleasure of working with Max for 6+ years at MarkLogic, think he’s one of the smartest people I know, and wish him well in this new endeavor.  One way I judge my own success as a manager is to judge the career success of those who worked for me.  I think this is a great move for Max and a great move for 10gen.

Six Thoughts on The NoSQL Movement

We are in the middle of one of our periodic analyst tours at MarkLogic, where we meet about 50 top software industry analysts focused in areas like enterprise search, enterprise content management, and database management systems.  The NoSQL movement was one of four key topics we are covering, and while I’d expected some lively discussions about it, most of the time we have found ourselves educating people about NoSQL.

In this post, I’ll share the six key points we’re making about NoSQL on the tour.

Our first point is that NoSQL systems come in many flavors and it’s not just about key/value stores.  These flavors include:

  • Key/value stores (e.g., Hadoop)
  • Document databases (e.g., MarkLogic, CouchDB)
  • Graph databases (e.g., AllegroGraph)
  • Distributed caching systems (e.g., Memcached)

Our second point is that NoSQL is part of a broader trend in database systems:  specialization.  The jack-of-all-trades relational database (e.g., Oracle, DB2) works reasonably well for a broad range of applications — but it is a master of none.  For any specific application, you can design a specialized DBMS that will outperform Oracle by 10 to 1000 times.  Specialization represents, in aggregate, the biggest threat to the big-three DBMS oligopolists.  Examples of specialized DBMSs include:

  • Streambase, Skyler:  real-time stream processing
  • MarkLogic:  semi-structured data
  • Vertica, Greenplum:  mid-range data warehousing
  • Aster:  large-scale (aka “big data”) analytic data warehousing
  • VoltDB:  high volume transaction processing
  • MATLAB:  scientific data management

Our third point is that NoSQL is largely orthogonal to specialization.  There are specialized NoSQL databases (e.g., MarkLogic) and there are specialized SQL databases (e.g., Aster, Volt).  The only case where I think there are zero examples is general-purpose NoSQL systems.  While I’m sure many of the NoSQL crowd would argue that their systems can do everything, is anyone *really* going to run general ledger or opportunity management on Hadoop?   I don’t think so.

Our fourth point is that NoSQL isn’t about open source.  The software-wants-to-be-free crowd wants to build open source into the definition of NoSQL and I believe that is both incorrect and a mistake.  It’s incorrect because systems like MarkLogic (which uses an XML data model and XQuery) are indisputably NoSQL.  And it’s a mistake because technology movements should be about technology, not business models.  (The open source NoSQL gang can solve its problem simply by affiliating with both the NoSQL technology movement and the open source business model movements.)

As CEO of a company that’s invested a lot of energy in supporting standards, our fifth point was that, rather ironically, most open source NoSQL systems have proprietary interfaces.  People shouldn’t confuse “can access the source code” with “can write applications that call standard interfaces” and ergo can swap components easily.   If you take offense at the word proprietary, that’s fine.  You can call them unique instead.  But the point is an application written on Cassandra is not practically moved to Couch, regardless of whether you can access the source code both Couch and Cassandra.

Our sixth point is that we think MarkLogic provides a best-of-both-worlds option between open source NoSQL systems and traditional DBMSs.  Like open source NoSQL systems, MarkLogic provides shared-nothing clustering on inexpensive hardware, superior support for unstructured data, document-orientation, and high-performance.  But like traditional databases, MarkLogic speaks a high-level query language, implements industry standards, and is commercial-grade, supported software.  This means that customers can scale applications on inexpensive computers and storage, avoid the pains of normalization and joins, have systems that run fast, can be implemented by normal database programmers, and feel safe that their applications are built via a standard query language (XQuery) that is supported by scores of vendors.

Yes, Virginia, MarkLogic is a NoSQL System

The other day I noticed a taxonomy used on one of the NoSQL Database blogs that went like this:

Types of NoSQL systems

  • Core NoSQL Systems
    • Wide column stores
    • Document stores
    • Key-value / tuple stores
    • Eventually consistent key-value stores
    • Graph databases
  • Soft NoSQL Systems (not the original intention …)
    • Object databases
    • Grid database solutions
    • XML databases
    • Other NoSQL-related databases

I, perhaps obviously, take some umbrage at having MarkLogic (acceptably classified as an XML database) being declared “soft NoSQL.”  In this post I’ll explain why.

Who decided that being open source was a requirement to be real NoSQL system?  More importantly, who gets to decide?  NoSQL – like the Tea Party – is a grass-roots, effectively leaderless movement towards relational database alternatives.  Anyone arguing original intent of the founders is misguided because there is no small group of clearly identified founders to ask.  In reality, all you can correctly argue is what you think was the intent of the initial NoSQL developers and early adopters, or — perhaps more customarily — why you were drawn to them yourself, disguised or confused as original founder intent.

As mentioned here, movements often appear homogeneous when they are indeed heterogeneous.  What looks like a long line of demonstrators protesting a single cause is in fact a rugby scrum of different groups pushing in only generally aligned directions.  For example, for each of the following potential motivations, I am certain that I can find some set of NoSQL advocates that are motivated by it:

  • Anger at Oracle’s heavy-handed licensing policies
  • The need to store unstructured or semi-structured data that doesn’t fit well into relations
  • The impedance mismatch with relational databases
  • A need and/or desire to use open source
  • An attempt to reduce total cost
  • A desire to land at a different point in the Brewer CAP Theorem triangle of consistency, availability, and partition tolerance
  • Coolness / wannabe-ism, as in, I want to be like Google or Facebook

(Since this was a source of confusion in prior posts, note that this is not to claim the inverse:  that all NoSQL advocates are motivated by all of the possible motivations.)

I’d like to advocate a simple idea:  that NoSQL means NoSQL.  That a NoSQL system is defined as:

A structured storage system that is not based on relational database technology and does not use SQL as its primary query language

In short, my proposed definition means that NoSQL (broadly) = NoSQL (literally) + NoRelational.  In short:  relational database alternatives.  It does not mean:

  • NoDBMS.  We should not take NoSQL to exclude systems we would traditionally define as DBMSs.  For example, supporting ACID transactions or supporting a non-SQL query language (e.g., XQuery) should not be exclusion criteria for NoSQL.
  • NoCommercialSoftware.  While many of the flagship NoSQL projects (e.g., Hadoop, CouchDB) are open source projects, that should be not a defining criterion.  NoSQL should be a technological, not a delivery- or business-model, classification.  Technology and delivery model are orthogonal dimensions.   We should be able to speak of traditionally licensed, open source licensed, and cloud-hosted NoSQL systems if for no other reason than understanding the nuances of the various business/delivery models is a major task unto itself.  Do you mean open source or open core?  Is it open source or faux-pen source?  Under which open source license?  How should I think of a hosted subscription service that is a based on or a derivative of an open source project?

Recently, I’ve heard a piece of backpeddling that I’ve found rather irritating:  that NoSQL was never intended to mean “no SQL,” it was actually intended to mean “not only SQL.”  Frankly, this strikes me as hogwash:  uh oh, I’m afraid that people are seeing us as disruptors and it’s probably easier to penetrate the enterprise as complementary, not competitive, so let’s turn what was a direct assault into a flanking attack.

To me, it’s simple:  NoSQL means NoSQL.  No SQL query language and no relational database management system.  Yes, it’s disruptive and — by some measures — “crazy talk” but no, we shouldn’t hide because there are lots of perfectly valid (and now socially acceptable) reasons to want to differ from the relational status quo.

In effect, my definition of NoSQL is relational database alternative.  Such options include both alternative databases (e.g., MarkLogic) and database alternatives (e.g., key/value stores).  This, of course, then cuts at your definition of database management system where I (for now at least) still require the support of a query language and the option to have ACID transactions.

By the way, I understand the desire to exclude various bandwagon-jumpers from the NoSQL cause.  Like most, I have no interest in including thrice-reborn object databases in the discussion, but if the cost of excluding them is excluding systems like MarkLogic then I think that cost is too high.  Many people contemplating the top-of-mind NoSQL systems (e.g., Hadoop) could be better served using MarkLogic which addresses many typical NoSQL concerns, including:

  • Vast scale
  • High performance
  • Highly parallel shared-nothing clusters
  • Support for unstructured and semi-structured data

All with all the pros (and cons) of being a commercial software package and without requiring reduced consistency:  losing a few Tweets won’t kill Twitter, but losing a few articles, records, or individuals might well kill a patient, bank, or counter-terrorism agency.  BASE is fine for some; many others still need ACID.  Michael Stonebraker has some further points on this idea in this CACM post.

I’d like to suggest that we should combine the ideas in this post with the ideas in my prior one, Classifying Database Management Systems.  That post says the correct way to classify DBMSs is by their native modeling element (e.g., table, class, hypercube).  This post says that NoSQL is semi-orthogonal – i.e., I can imagine a table-oriented database that doesn’t use SQL as its query language, but I doubt that any exist.  Applying my various rules, the combined posts say that:

  • Aster is a SQL database optimized for analytics on big data
  • MarkLogic is an XML [document] database optimized for large quantities of semi-structured information and a NoSQL system
  • CouchDB is a document database and a NoSQL system
  • Reddis is a key/value store and a NoSQL system
  • VoltDB is a SQL database optimized to solve one of the two core problems that NoSQL systems are built for (i.e., high-volume simple processing)

Finally, I’d conclude that even with these rules I have trouble classifying MarkLogic because of multiple inheritance:  MarkLogic is both a document database and an XML database, it is difficult to pick one over the other, and I there certainly are non-document-oriented XML database systems.   Similar issues exist with classifying the various hybrids of document databases and key/value stores.  So while I may have more work to do on building an overall taxonomy, I am absolutely sure about one thing:  MarkLogic is a NoSQL system.


* The “Yes, Virginia” phrase comes from a 1897 story in the New York Sun.  For more, see here.

Classifying Database Management Systems: Regular and NoSQL

Thanks to two major trends — DBMS specialization and the NoSQL movement — the database management systems space is generating more interest and more innovation than any time I can remember since the 1980s.  Ever since around 1990, when the relational database management system (RDBMS) became firmly established, IT has played DBMSroulette:  spin the wheel and use the DBMS on which the needle lands — Oracle, DB2, or SQL Server.  (If you think this trivializes things, not so fast:  a friend who was the lead DBMS analyst at a major analyst firm once quipped to me that this wheel-spinning was his job, circa 1995.)

Obviously, there was always some rational basis for DBMS selection — IBM shops tended to pick DB2, best-of-breed buyers liked Oracle, performance whizzes and finance types often picked Sybase, and frugal shoppers would choose SQL Server, and later MySQL — but there was no differentiation in the model.  All these choices were relational database management systems.

Over time, our minds became dulled to orthogonal dimensions of database differentiation:

  • The database model.  For years, we lived in the database equivalent world of Henry Ford’s Model T:  any model you want as long as it’s relational.
  • The potential for trade-offs in fundamental database-ness.  We became binary and religious about what it meant be a database management system and that attitude blinded us to some fundamental trade-offs that some users might want to make — e.g., trading consistency for scalability, or trading ACID transactions for BASE.

The latter is the domain of Brewer’s CAP theorem which I will not discuss today.  The former, the database model, will be the subject of this post.

Every DBMS has some native modeling element (NME). For example, in an RDBMS that NME is the relation (or table).  Typically that NME is used to store everything in the DBMS.  For example, in an RDBMS:

  • User data is stored in tables.
  • Indexes are implemented as tables which are joined back to the base tables.
  • Administration information is stored in tables.
  • Security is usually handled through tables  and joins.
  • Unusual data types (e.g., XML) are stored in “odd columns” in tables.  (If your only model’s a table, every problem looks like a column.)

In general, the more naturally the data you’re storing maps to the paradigm (or NME) of the database, the better things will work.  For example, you can model XML documents as tables and store them in an RDBMS, or you can model tables in XML and store them as XML documents, but those approaches will tend to be more difficult to implement and less efficient to process than simply storing tables in an RDBMS and XML documents in an XML server (e.g., MarkLogic).

The question is not whether you can model documents as tables or tables as documents.  The answer is almost always yes.  Thus, the better question is should you?  The most famous example of this type of modeling problem is the storage of hierarchical data in an RDBMS.  To quote this article on managing hierarchical data in MySQL:

Most users at one time or another have dealt with hierarchical data in a SQL database and no doubt learned that the management of hierarchical data is not what a relational database is intended for.

(Personally, I blame the failure of Microsoft’s WinFS on this root problem — file systems are inherently hierarchical — but that’s  a story for a different day.)

I believe the best way to classify DBMSs is by their native modeling element.

  • In hierarchical databases, the NME is the hierarchy.  Example:  IMS.
  • In network databases, it’s the (directed, acyclic) graph. Example:  IDMS.
  • In relational databases, it’s the relation (or, table).  Example:  Oracle.
  • In object databases, it’s the (typically C++) object class. Example:  Versant.
  • In multi-dimensional databases, it’s the hypercube. Example:  Essbase.
  • In document databases, it’s the document. Example:  CouchDB.
  • In key/value stores, it’s the key/value pair. Example:  Redis.
  • In XML databases, it’s the XML document. Example:  MarkLogic.

The biggest limitation of this approach is that classifying by model fails to capture implementation differences. Some examples:

  • I would classify columnar DBMSs (e.g., Vertica) as relational if they model data as tables, and key/value stores (e.g., Hbase) as such if they model data in key/value pairs.  This fails to capture the performance advantage that Vertica gets on certain data warehousing problems due to its column orientation.
  • I would classify all relational databases as relational, despite implementation optimizations.  For example, this approach fails to capture Teradata’s optimizations for large-scale data warehousing, Aster’s optimizations for analytics on big data, or Volt’s optimizations for what Curt Monash calls HVSP.
  • I would classify all XML databases as XML databases, despite possible optimization differences for the two basic XML use-cases:  (1) XML as message wrapper vs. (2) XML as document markup.

Nevertheless, I believe that DBMSs should be classified first by model and then sub-classified by implementation optimization.  For example, a relational database optimized for big data analytics (Aster).  An XML database optimized for large amounts of semi-structured information marked in XML (MarkLogic).

In closing, I’d say that we are seeing increasing numbers of customers coming to Mark Logic saying:  “well, I suppose we could have modeled this data relationally, but in our business we think of this information as documents and we’ve decided that it’s easier and more natural to manage it that way, so we decided to give you a call.”

After thinking about this for some time, I have one response:  keep calling!

No matter how you want to think about MarkLogic Server — an XML server, an XML database, or an XML document database — dare I say an [XML] [document] server|database  — it’s definitely a document-oriented, XML-oriented database management system and a great place to put any information that you think is more naturally modeled as documents.

My Thoughts on the NoSQL Database "Tea Party" Post

Without a doubt, the most controversial post I’ve written on this blog was last month’s The Database Tea Party: The NoSQL Movement.  I know this both from the comment stream, but also from the volume and tenor of emails I received from friends and colleagues over the past few weeks.

The first thing I learned is that standing in the middle is a great way to get attacked from all sides.

  • My database buddies blasted me, treating me like a reckless turncoat:   how dare you endorse these BASE people?  (Humorous double entendre intended.)
  • The NoSQL folks blasted me, generally misunderstanding the protest march metaphor I was using (see below).
  • The above-it-all crowd blasted me for oversimplifying the issue, suggesting that I was endorsing simplistic views such as it’s about batch vs. online or it’s about scaling vs. not scaling.

All of which, of course, only confirmed the religious nature of the movement and that was indeed a movement, regardless of whether any given participant identified himself as such.

(Several folks also blasted me for using the Tea Party Movement as a metaphor.  Let me clarify that this is not a political blog so I won’t debate politics.  I intended the metaphor to cover only concepts like “rebellion” and “grass roots,” both of which I do believe apply to NoSQL.)

First, I’ll clarify the protest march metaphor which was easily the most misunderstood aspect of the post.   Let me share my first rule of protests:  not everyone is there for the same reason.  Some people are there for the stated cause, which I’ll call cause A.  But others participate on behalf of a group; their signs will say “Group 1′s for Cause A.”  Others are there for cause B which lacks enough support to generate its own march, so they tag along with signs saying “Cause A and Cause B.”  If you’ve ever been at a rally, you’ve invariably winced when speakers attempted to hijack the agenda, turning to some personal cause, all while pretending to speak on behalf of the group.

It was this sense of chaos and disorder that I was trying to portray.  When I made the list of reasons why I thought people were on the NoSQL march, it was neither to say that I agreed with them or that all people were there for all reasons.  I was doing the equivalent of asking protesters on the  UC Berkeley Anti-Grenada March why they were there.  To which the replies might have been:

  • To protest the Grenada invasion (cause A)
  • To remind people about [insert group here] rights, all while protesting against Grenada
  • To protest about UC budget cuts, all while protesting about something the Government did, which I’m pretty sure was bad.
  • Because I go to every march; what’s this one about?
  • Because I was at Top Dog when it went by and had nothing better to do

So hopefully, the intent of my NoSQL-reasons-list is now clear.   Some folks are there because they don’t want ACID transactions.  Some folks are there because they are dealing with Internet scale.  Some folks are there because they hate the SQL impedance mismatch.  Some folks are there because they’re tired of paying oligopoly prices to Oracle.   (I particularly liked the comment that said Oracle was free because they had an enterprise license that most certainly wasn’t, ignoring the possibility that recent enterprise/agency directives to look at open source could result directly from the size of the last Oracle check.)

And yes, some folks are there because it’s cool and they want to be like Twitter, Google, and Facebook, but getting them to admit that is a virtual impossibility.

One irony is that I actually agree with one of the fiercest commenters:

A very interesting write-up with one little oversight: you’re wrong.

I am part of a large program to write a NoSQL database for military applications. [It's not about ...] It’s [about]  the fact that RDBMSs are built in a different space in the CAP trades.

Google, Amazon, Facebook, and DARPA all recognized that when you scale systems large enough, you can never put enough iron in one place to get the job done (and you wouldn’t want to, to prevent a single point of failure). Once you accept that you have a distributed system, you need to give up consistency or availability, which the fundamental transactionality of traditional RDBMSs cannot abide. Based on the realization that something fundamentally different needed to be built, a lot of very smart people tackled the problem in a variety of different ways, making different trades along the way. [...]

So – the NoSQL databases are a pragmatic response to growing scale of databases and the falling prices of commodity hardware. It’s not a noble counterculture movement (although it does attract the sort that have a great deal of mental flexibility), it’s just a way to get business done cheaper.

To respond to the commenter:

  • Thank you for the clear definition of why you moved to NoSQL.
  • Your comment was picked up by the Otaku blog in an post called NoSQL Explained Correctly (Finally), so congrats and I’m glad I could help facilitate “the conversation.”
  • Disorder and chaos is what I was trying to portray in the protest march metaphor, not hippies or nobility
  • You are clearly using NoSQL for two of the reasons on my list:  scalability and bloatware (i.e., perhaps not the best word choice, but the idea was undesired, included functionality — e.g., ACID transactions)
  • You did exactly what I said to do:  consider all alternatives and do what’s right for your business
  • So, why are we disagreeing again?

I think some people didn’t like my putting “coolness” on the table as a factor and the notion of a “movement.”  I believe those are both very real and ironically those who disagreed with me loudest were effectively screaming:  it’s not a movement and I’m not doing it to be cool; I’m doing it because it’s right for my business.  If so, great.  But why does it hit such a nerve?

In the end, when it comes to NoSQL I am trying to:

  • Provide an overview of why I think people are considering and/or using NoSQL solutions
  • Provide good background references and readings (see bottom of my first post)
  • Remind mangers to keep an eye out for the “bad reasons” to go NoSQL — i.e,. coolness and Google wannabeism
  • Remind people not to confuse NoSQL with NoDatabase.  Special-purpose databases (e.g., MarkLogic) are optimized for specific applications (e.g., semi-structured data) and handle them far better than a general-purpose RDBMS.  So in your haste to move off Oracle, don’t advance directly to an open source key-value store; there might be alternative DBMSs that meet your needs more effectively.
  • Remind people not to confuse NoSQL with NoCommercialSoftware.  While people seem to dislike when I say it, the RDBMS market is an oligopoly and the big vendors’ pricing, margins, and heavy-handed customer relationships are all consistent with that market structure.  But you can find other classes of commercial software where the vendors are hungrier and more customer centric.

The Database Tea Party: The NoSQL Movement

Adam Smith’s invisible hand never rests.  Just five years ago, the database market looked like a static, three-player $10B/year oligopoly where the primary forces were inertia and profit-taking.  Today, we have two major forces disrupting the comfortable stasis that has developed over the past 30 years.

  • One force is DBMS specialization:  while the general-purpose RDBMS is useful for a broad range of applications, it is optimal for few of them.  The RDBMS has slowly become expensive bloatware that is functionally a jack of all trades, master of none.  MIT’s Michael Stonebraker calls the RDBMS a one size fits all solution.
  • The other force is NoSQL, an organic and rapidly-growing industry movement away from relational databases, driven by a number of factors including both technology and cost.

The purpose of this post is to share my thoughts on NoSQL.  Make no mistake, like the Tea Party Movement, NoSQL is a rebellion; just look at the name.  But like most demonstrations, not everyone is marching for the same reasons.  Here are some of the things I think various members of the NoSQL crowd are marching against:

  • Table-oriented, 1960s-era database technology:  RDBMSs were designed for handling data and short-text fields, necessitate mapping programmatic objects to tables (i.e., the impedance mismatch), and require the use of an increasingly stone-age query language, SQL.
  • Scalability:  relational databases were not designed to handle and do not generally cope well with Internet-scale, “big data” applications.  Most of the big Internet companies (e.g., Google, Yahoo, Facebook) do not rely on RDBMS technology for this reason.
  • High prices and the heavy-handed treatment of customers:  both stem from the underlying oligopoly and the lack of credible alternative suppliers
  • Closed source:  the inability to customize the internals of the DBMS engine to meet specific needs
  • Bloatware:  ironically that while RDBMSs are perceived as light in requirements that matter (e.g., scalability), they are  also seen as over-engineered for features that don’t.  (ACID transactions are a favorite target in this department.)
  • DBA supremacy.  For years, corporate DBAs called the shots on where strategic data assets would be stored, and thus how they would be accessed.  This created headaches for the programmers of the world who, in response, have done as much as possible to abstract away the database (e.g., Ruby on Rails).

On the flip side, there are things the NoSQL crowd are fighting for:

  • Open source, implying control.  The ability that open source software provides to customize product functionality.
  • Open source, implying free.  The often-flawed notion that the absence of software license fees results in a reduced lifetime cost of ownership.
  • Coolness, or the “I want to be like Google” effect.  If Google’s got BigTable,  Yahoo’s got Hadoop, and Facebook’s got Cassandra, then we should build our own, too.  Our app is hard; we’re smart guys, too.
  • Vengeance, or the “I’m so mad at Oracle that I’ll do anything” effect.  Yes, some folks are just plain mad enough at Oracle to either go write their own DBMS, or take on the support of a very low-level infrastructure technology.

So, if you’re considering a NoSQL solution — a class in which I include MarkLogic — you need to figure out what you’re marching against, what you’re fighting for, and ultimately what will meet your needs at the lowest total cost of ownership.

My first recommendation to detect and, where applicable, kill off the coolness effect.  Google is swimming in money and PhDs.  They can build anything they want regardless of whether they should and, right or wrong,  for Google it just doesn’t matter.  So unless you have Google’s business model and talent pool, you probably shouldn’t copy their development tendencies.

Heck, I get the coolness attraction.  I think infrastructure software is cool, too.  That’s why I was an OS geek early on and have spent my career around databases.  But I surely don’t think that F1000 companies and government agencies should build their own DBMSs, nor fall into the trap of thinking that open source low-level stores are a free and easy way to avoid Oracle license fees.  Cool shouldn’t be in the equation.  Technology suitability and total cost should be.  Period.

My second recommendation is to orthogonalize the open source question, making it independent of functional requirements.  (This breaks if source customization is a requirement, but remember that requirement is often fictional:  most open source users don’t customize.)  If you’re struggling with an RDBMS on a given application problem you shouldn’t say:  we need an open source, NoSQL type thing.  You should say:  we need to look at relational database alternatives.  Those alternatives include a open source database projects (e.g., MongoDB, CouchDB) and distributed computing frameworks (e.g., Hadoop), but they also include commercial software offerings such as specialized DBMSs like Streambase (for real-time streams), Aster (for analytics on big data), and MarkLogic (for semi-structured data).  Don’t throw out the commercial-software-benefits baby with the RDBMS bathwater.

My personal take on this issue is that:

  • Relational databases, like the mainframe in 1985,  are entering the Autumn of their lives.  They won’t die quickly and mainframe isn’t dead today, but their best days are behind them.
  • Our kids will see SQL the way we see COBOL.  Some people can’t stand when I say this, but I think they’re in denial.  There is no logical reason to assume that the relational database and the SQL language are the endpoints in database evolution.  Yes, Larry Ellison is powerful.  But Adam Smith is more so.
  • Our kids will see no data/document dichotomy.  They will just see digital information.  We need to understand and remember that the data/document dichotomy is an artifact of the limitations of the tools and technologies with which we grew up.
  • Some of the NoSQL hype is an over-reaction to the database oligopoly.  I believe there are organizations out there who should be using alternative commercial databases, but instead are using open source NoSQL-type projects due to coolness, anger, or a mistaken belief that open source always has a lower total cost of ownership.  I believe rationality will return to these people.  One day management will say:  “Holy cow!  Why in the world are we paying programmers to write and support software at this low a level?”  (This is potentially avoidable if you can mentally project yourself into the future now and imagine how you will look back at the coming three years.)
  • Some of the NoSQL hype is a valid reaction to the technological limits of relational databases and the impedance mismatch in programming on them.

In the end, I think it’s great that the NoSQL movement is happening.  It’s awakening people to traditional RDBMS alternatives.  It’s making people understand that they don’t have to write big checks for commodity software.  It’s helping people solve problems that they can’t solve, or solve efficiently, on relational technology.

My axe to grind is simple:  just because you’re throwing out Oracle, don’t throw out all DBMSs and all commercial software with it.  Take a breath.  Look at all your alternatives.  Study total costs and technology applicability.  And make your best decision.

Interesting Writings on NoSQL