Amazon Web Services today announced Amazon Elastic MapReduce, a new member of the Amazon web services family designed to help users process vast amounts of data using the divide-and-conquer parallel processing approach made famous by Google’s MapReduce and as implemented in the Apache Hadoop project.
Background on Hadoop (from the project site):
Here’s what makes Hadoop especially useful–
- Scalable: Hadoop can reliably store and process petabytes.
- Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
- Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.
- Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.
Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.
Here’s some background on MapReduce (from Google Labs):
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.
So Amazon Elastic MapReduce is a cloud-based service that enables you to perform highly parallel operations against large amounts of data, all in an on-demand model. This strikes me as a great offering, particularly for those organizations who have an intermittent need for large Hadoop clusters.
From the Amazon press release:
It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for distributed applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. As with all AWS services, Amazon Elastic MapReduce customers will still only pay for what they use, with no up-front payments or commitments.
Amazon says they made the offering in response to users who were already deploying Hadoop clusters on their lower-level EC2 framework — i.e., that this was an organic evolution:
“Some researchers and developers already run Hadoop on Amazon EC2, and many of them have asked for even simpler tools for large-scale data analysis,” said Adam Selipsky, Vice President of Product Management and Developer Relations for Amazon Web Services. “Amazon Elastic MapReduce makes crunching in the cloud much easier as it dramatically reduces the time, effort, complexity and cost of performing data-intensive tasks.”
I suspect this was a bad day at CloudEra, an Accel-backed startup that wants to be the RedHat of Hadoop. Perhaps, like SugarCRM in competing against Salesforce, CloudEra will soon offer an on-demand Hadoop as well. But that means supporting two business models at once and buying a lot of hardware to boot. And, I suspect, a lot more hardware than SugarCRM needs to buy to support sales automation as a service.