Category Archives: Hadoop

Open Source Business Models, Revisited

I had breakfast the other day with Mike Olson, CEO of Hadoop ecosystem leader, Cloudera.  We met because we run in similar circles in data management land and because Mike had some quibbles with my post, The Open Source Software Paradox.

My premise was that open source presents a fundamental paradox:   the larger the community, the better the software, and the less people need to buy support for it.  Thus, that open source market opportunities were inherently flawed / paradoxical because you could only sell services for projects  that were not terribly successful.  Simply put,

You can have a large community who doesn’t need to buy from you or a small community who does.

I think Mike’s overall take on my post was “1990s thinking” because things have evolved over the past decade and businesses now try to monetize open source opportunities in more sophisticated ways.  This approach doesn’t actually contradict the paradox I observed, but instead looks  for more creative ways around it.

Another key point Mike made was that open source is not a business model.  I agree.  Open source is a way of developing software.  There are many different possible business models for monetizing open source projects.

Rather than attempt to replay the back-and-forth of our discussion, I will simply list my revised take on the 4 basic open source business models.

  • Professional services.  The most basic way to make money around an open source project is to offer related consulting (and training) services.  For example, ThinkBigAnalytics, seems to  building a consulting business around Hadoop and NoSQL databases (most of which are also open source).
  • Dual licensing.  A vendor offers (1) a free version under the GPL license which freely enables internal use but contaminates on redistribution and (2) a paid version under a different license that doesn’t include GPL’s copyleft provisions.  This model reeks of the vig as you force people under threat (of open sourcing their system) if they don’t move to the non-GPL version.  In addition, since SaaS or cloud services use but don’t redistribute software, this approach loses its teeth in the SaaS / cloud world.
  • Open core.  A vendor promotes an open source version of a system and makes money by extending it with proprietary additions.  In this model, the vendor “has some IP” and is not totally dependent on support subscriptions which may or may not be renewed.  Cloudera is executing this strategy by offering both (1) the Cloudera Distribution on an Apache license as well as (2) Cloudera Enterprise which is built on the Cloudera Distribution but also includes production support and management applications.

The open core model clearly sidesteps the paradox I’d outlined because open core vendors offer more than support.  Open core is a freemium business model and possesses all the strengths and suffers from all the weaknesses of other freemium models.

  • First, can you build a large community on the free version or service?
  • Second, through what mechanism and at what cost you monetize members of that community to a higher-level service?
  • Third, once monetized at what rate can you keep premium members renewing the premium service or moving them up to an even higher service level?

LinkedIn has done freemium spectacularly well.  I’ve never paid them a dime (as a free service user) but somebody paid them the ~$250M they made in the first 9 months of the year.  (Turns out it’s about 33% each of premium subscriptions, hiring solutions, and marketing solutions.)

The newspapers still haven’t figured out freemium though FT and The New York Times are making headway.

How will open core play out for open source vendors?  I don’t know.  I do know the freemium code is hard to crack.  I do know that freemium models are constantly evolving.  I do believe that freemium is a better business model than simply offering support or services.  And with the  IPO window opening, I do believe we may get a chance to see the financials of a few open core companies in the coming years.

Yes, Virginia, MarkLogic is a NoSQL System

The other day I noticed a taxonomy used on one of the NoSQL Database blogs that went like this:

Types of NoSQL systems

  • Core NoSQL Systems
    • Wide column stores
    • Document stores
    • Key-value / tuple stores
    • Eventually consistent key-value stores
    • Graph databases
  • Soft NoSQL Systems (not the original intention …)
    • Object databases
    • Grid database solutions
    • XML databases
    • Other NoSQL-related databases

I, perhaps obviously, take some umbrage at having MarkLogic (acceptably classified as an XML database) being declared “soft NoSQL.”  In this post I’ll explain why.

Who decided that being open source was a requirement to be real NoSQL system?  More importantly, who gets to decide?  NoSQL – like the Tea Party – is a grass-roots, effectively leaderless movement towards relational database alternatives.  Anyone arguing original intent of the founders is misguided because there is no small group of clearly identified founders to ask.  In reality, all you can correctly argue is what you think was the intent of the initial NoSQL developers and early adopters, or — perhaps more customarily — why you were drawn to them yourself, disguised or confused as original founder intent.

As mentioned here, movements often appear homogeneous when they are indeed heterogeneous.  What looks like a long line of demonstrators protesting a single cause is in fact a rugby scrum of different groups pushing in only generally aligned directions.  For example, for each of the following potential motivations, I am certain that I can find some set of NoSQL advocates that are motivated by it:

  • Anger at Oracle’s heavy-handed licensing policies
  • The need to store unstructured or semi-structured data that doesn’t fit well into relations
  • The impedance mismatch with relational databases
  • A need and/or desire to use open source
  • An attempt to reduce total cost
  • A desire to land at a different point in the Brewer CAP Theorem triangle of consistency, availability, and partition tolerance
  • Coolness / wannabe-ism, as in, I want to be like Google or Facebook

(Since this was a source of confusion in prior posts, note that this is not to claim the inverse:  that all NoSQL advocates are motivated by all of the possible motivations.)

I’d like to advocate a simple idea:  that NoSQL means NoSQL.  That a NoSQL system is defined as:

A structured storage system that is not based on relational database technology and does not use SQL as its primary query language

In short, my proposed definition means that NoSQL (broadly) = NoSQL (literally) + NoRelational.  In short:  relational database alternatives.  It does not mean:

  • NoDBMS.  We should not take NoSQL to exclude systems we would traditionally define as DBMSs.  For example, supporting ACID transactions or supporting a non-SQL query language (e.g., XQuery) should not be exclusion criteria for NoSQL.
  • NoCommercialSoftware.  While many of the flagship NoSQL projects (e.g., Hadoop, CouchDB) are open source projects, that should be not a defining criterion.  NoSQL should be a technological, not a delivery- or business-model, classification.  Technology and delivery model are orthogonal dimensions.   We should be able to speak of traditionally licensed, open source licensed, and cloud-hosted NoSQL systems if for no other reason than understanding the nuances of the various business/delivery models is a major task unto itself.  Do you mean open source or open core?  Is it open source or faux-pen source?  Under which open source license?  How should I think of a hosted subscription service that is a based on or a derivative of an open source project?

Recently, I’ve heard a piece of backpeddling that I’ve found rather irritating:  that NoSQL was never intended to mean “no SQL,” it was actually intended to mean “not only SQL.”  Frankly, this strikes me as hogwash:  uh oh, I’m afraid that people are seeing us as disruptors and it’s probably easier to penetrate the enterprise as complementary, not competitive, so let’s turn what was a direct assault into a flanking attack.

To me, it’s simple:  NoSQL means NoSQL.  No SQL query language and no relational database management system.  Yes, it’s disruptive and — by some measures — “crazy talk” but no, we shouldn’t hide because there are lots of perfectly valid (and now socially acceptable) reasons to want to differ from the relational status quo.

In effect, my definition of NoSQL is relational database alternative.  Such options include both alternative databases (e.g., MarkLogic) and database alternatives (e.g., key/value stores).  This, of course, then cuts at your definition of database management system where I (for now at least) still require the support of a query language and the option to have ACID transactions.

By the way, I understand the desire to exclude various bandwagon-jumpers from the NoSQL cause.  Like most, I have no interest in including thrice-reborn object databases in the discussion, but if the cost of excluding them is excluding systems like MarkLogic then I think that cost is too high.  Many people contemplating the top-of-mind NoSQL systems (e.g., Hadoop) could be better served using MarkLogic which addresses many typical NoSQL concerns, including:

  • Vast scale
  • High performance
  • Highly parallel shared-nothing clusters
  • Support for unstructured and semi-structured data

All with all the pros (and cons) of being a commercial software package and without requiring reduced consistency:  losing a few Tweets won’t kill Twitter, but losing a few articles, records, or individuals might well kill a patient, bank, or counter-terrorism agency.  BASE is fine for some; many others still need ACID.  Michael Stonebraker has some further points on this idea in this CACM post.

I’d like to suggest that we should combine the ideas in this post with the ideas in my prior one, Classifying Database Management Systems.  That post says the correct way to classify DBMSs is by their native modeling element (e.g., table, class, hypercube).  This post says that NoSQL is semi-orthogonal – i.e., I can imagine a table-oriented database that doesn’t use SQL as its query language, but I doubt that any exist.  Applying my various rules, the combined posts say that:

  • Aster is a SQL database optimized for analytics on big data
  • MarkLogic is an XML [document] database optimized for large quantities of semi-structured information and a NoSQL system
  • CouchDB is a document database and a NoSQL system
  • Reddis is a key/value store and a NoSQL system
  • VoltDB is a SQL database optimized to solve one of the two core problems that NoSQL systems are built for (i.e., high-volume simple processing)

Finally, I’d conclude that even with these rules I have trouble classifying MarkLogic because of multiple inheritance:  MarkLogic is both a document database and an XML database, it is difficult to pick one over the other, and I there certainly are non-document-oriented XML database systems.   Similar issues exist with classifying the various hybrids of document databases and key/value stores.  So while I may have more work to do on building an overall taxonomy, I am absolutely sure about one thing:  MarkLogic is a NoSQL system.


* The “Yes, Virginia” phrase comes from a 1897 story in the New York Sun.  For more, see here.

Amazon Elastic MapReduce: Power to Burn, On Demand

Amazon Web Services today announced Amazon Elastic MapReduce, a new member of the Amazon web services family designed to help users process vast amounts of data using the divide-and-conquer parallel processing approach made famous by Google’s MapReduce and as implemented in the Apache Hadoop project.

Background on Hadoop (from the project site):

Here’s what makes Hadoop especially useful–
  • Scalable: Hadoop can reliably store and process petabytes.
  • Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
  • Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.
  • Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.

Here’s some background on MapReduce (from Google Labs):

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

So Amazon Elastic MapReduce is a cloud-based service that enables you to perform highly parallel operations against large amounts of data, all in an on-demand model. This strikes me as a great offering, particularly for those organizations who have an intermittent need for large Hadoop clusters.

From the Amazon press release:

It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for distributed applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. As with all AWS services, Amazon Elastic MapReduce customers will still only pay for what they use, with no up-front payments or commitments.

Amazon says they made the offering in response to users who were already deploying Hadoop clusters on their lower-level EC2 framework — i.e., that this was an organic evolution:

“Some researchers and developers already run Hadoop on Amazon EC2, and many of them have asked for even simpler tools for large-scale data analysis,” said Adam Selipsky, Vice President of Product Management and Developer Relations for Amazon Web Services. “Amazon Elastic MapReduce makes crunching in the cloud much easier as it dramatically reduces the time, effort, complexity and cost of performing data-intensive tasks.”

I suspect this was a bad day at CloudEra, an Accel-backed startup that wants to be the RedHat of Hadoop. Perhaps, like SugarCRM in competing against Salesforce, CloudEra will soon offer an on-demand Hadoop as well. But that means supporting two business models at once and buying a lot of hardware to boot. And, I suspect, a lot more hardware than SugarCRM needs to buy to support sales automation as a service.