The Scoop On Hadoop: 11 Big Data Start-Ups You Need To Know4:00 PM EST Fri. Apr. 13, 2012
It's been six years since Yahoo developer Doug Cutting created a platform for managing, storing and analyzing large volumes of data, naming it for his son's toy elephant and turning it over to the Apache Software Foundation. Given how quickly an entire industry has built up around Hadoop, it will surprise some to learn that the Apache Software Foundation only recently debuted Apache Hadoop 1.0 -- the first release of the software considered stable enough to be "enterprise-ready."
But that hasn't slowed the rush of companies, both start-ups and established vendors, from jumping on the Hadoop bandwagon. With big data such a hot area, solution providers need to keep up with the key players. Here are 11 start-ups building businesses around Hadoop.
Cloudera, founded in 2008, is perhaps the most established of the young Hadoop-focused companies. The Palo Alto, Calif.-based company offers a commercial distribution of the Apache Hadoop software, as well as Cloudera Enterprise, a subscription service that includes support and a portfolio of software called the Cloudera Management Suite.
In an example of how Cloudera's technology is finding its way into broad use, Oracle said in January that it had integrated Cloudera's distribution of Hadoop and Cloudera Manager with the Oracle Big Data Appliance.
Cloudera scored serious bragging rights in 2009 when it hired Doug Cutting as "architect." Cutting is the co-founder of the original Apache Software Foundation Hadoop project and he sits on the Apache Software Foundation board.
The amount of data generated and stored by businesses is doubling every three years. Throw in the fact that data is a mix of structured and unstructured information, often scattered across disparate IT systems, and you have a serious challenge for any company with a business intelligence project.
The Datameer Analytics Solution from start-up Datameer (founded in 2009 and based in San Mateo, Calif.) combines Apache Hadoop with a spreadsheet interface that helps business users run analytics against very large data sets -- structured and unstructured data from multiple sources -- with no programming.
While analyzing big data used to be a big-company problem, Datameer makes a compelling point: Small and midsize companies now face similar challenges given that the low cost of commodity storage makes collecting huge volumes of data economically feasible.
Hadapt touts its Hadapt Adaptive Analytic Platform as combining the benefits of Hadoop and relational database management software into a single data platform. The result is a high-performance analytics system that's capable of working with both structured and unstructured data.
Founded in July 2010, the company raised $9.5 million in first-round funding in October and debuted Hadapt 1.0 in November under an "early access" program for potential customers to try out. The software, according to the company, offers "enormous performance improvements" over Hadoop and its Hive data warehousing technology. The software is available in cloud and enterprise versions and, shortly, a free community edition. They run on all major distributions of Hadoop including Amazon EMR, Apache, Cloudera, EMC, Hortonworks, IBM and MapR.
Launched in July 2011, Hortonworks is a spin-off of Yahoo's Hadoop engineering team that offers its own distribution of Hadoop called the Hortonworks Data Platform. The relatively young company, a contributor to the Apache project, is widely seen as the chief competitor to the more-established Cloudera.
In January the company launched Hortonworks Data Platform version 2 offering better performance and availability through a next-generation MapReduce architecture, enhanced scalability with Hadoop Distributed File System (HDFS) federation, and improved data integrity provided by HDFS NameNode high availability.
And yes, the Sunnyvale, Calif., company's name comes from the Dr. Seuss book "Horton Hears A Who," in keeping with the Hadoop elephant theme.
While Hadoop may be the de facto engine for processing large amounts of data, it's largely used for batch processing. Analyzing data in realtime takes the value of Hadoop to a whole new level. And that's where HStreaming comes in.
Founded in 2010, Chicago-based HStreaming is a scalable, continuous data analysis system built on Hadoop. It allows organizations to analyze, visualize and act upon massive amounts of continuous data -- such as a financial trading system might generate -- in realtime.
While most Hadoop-related companies are independent start-ups, Hyve Solutions is a division of IT distributor Synnex. Founded last year, Hyve Solutions offers turnkey appliances called the Big D Series 8 that the company said makes it possible to implement a Hadoop-based big data analysis system in days rather than months.
The Hyve Solutions platforms incorporate such big data technology as Zettaset's Hadoop-based fault-tolerant system, cloud networking gear from Arista Networks, network interface hardware and software from Solarflare Communications, and Fusion-io's Flash memory data storage technology.
Cupertino, Calif.-based Karmasphere calls itself the leader in "big data intelligence" with its software tools for extracting and analyzing data from Hadoop.
Karmasphere Analyst gives information analysts access to structured and unstructured data in Hadoop, allowing them to make ad hoc queries, and visualize and interact with the results. Karmasphere Studio provides tools for developing custom algorithms that run on Hadoop. And the Karmasphere Analytics Engine is the foundation for the company's software.
Karmasphere, launched in March 2010, has partnered with virtually all the vendors and organizations with Hadoop distributions, including the Apache Software Foundation, IBM, Cloudera, Amazon Web Services and Hortonworks. In February the company debuted Karmasphere Analyst 1.8 with new parallel query capabilities.
MapR Technologies offers a distribution of Apache Hadoop, putting it in competition with Cloudera and HortonWorks, among others. But the company, founded in June 2009, has some key advantages including a strategic alliance with EMC and $20 million raised in second-round funding in August.
MapR, based in San Jose, Calif., unveiled version 1.2 of its MapR Hadoop distribution in December with new virtual machine capabilities, a high-performance native access library, Mac and Windows clients, and the ability to take advantage of MapReduce 2.0 technology.
Mortar Data bills itself as "Hadoop, without the complexity." The New York-based company offers cloud-based Hadoop services to customers who are "sitting on a pile of underutilized data" and says it can have customers up and running in less than one hour.
Mortar Data, founded in 2010, creates a private, on-demand Hadoop cluster for clients' big data projects and creates "optimized jobs for execution" using Pig and Python. Amazon's S3 cloud storage is used for data read-write. Customers pay only for the time needed to run their jobs, without all the expense associated with IT infrastructure and with hiring and training engineers.
Tidemark Systems, Redwood City, Calif., develops what it calls the first enterprise performance management platform and applications built for cloud computing. Big data comes into the picture because the Tidemark EPM application system is built on Cloudera's distribution of Hadoop, allowing it to extract value from massive volumes of complex data.
Founded in 2010, Tidemark is targeting applications in manufacturing, consumer products, retail and high-tech companies. In January the company secured $24 million in third-round funding from venture capitalists and PeopleSoft founder Dave Duffield.
Originally launched in 2009 with the name GOTO Metrics, Zettaset has developed a fault-tolerant system built on Hadoop and other open-source technologies for aggregating and analyzing massive amounts of data. The technology, according to the company, helps manage the health, security and administration of the entire enterprise Hadoop system.
Zettaset, based in Mountain View, Calif., launched version 4 of its software in December with new service management features and a unique visual user interface. The company renamed itself in July (after closing a $3 million round of financing) after a zettabyte -- equal to 1 million petabytes or 1 billion terabytes of data.