EMC, NetApp Go Open Source With Hadoop

EMC plans to provide full open-source support for Hadoop with the eventual release of software, appliance, and eventually virtual appliance versions of the Hadoop technology in connection with technology it got with last year's Greenplum acquisition.

NetApp unveiled a new Hadoop storage appliance based on the E5400 storage subsystem it received with its acquisition of Engenio, which closed this week.

The Apache Hadoop project is a framework for running applications on large clusters built using commodity hardware. Hadoop works by breaking an application into multiple small fragments of work, each of which may be executed or re-executed on any node in the cluster.

It includes the Hadoop Distributed File System (HDFS) for reliably storing very large files across machines in a large cluster

Sponsored post

EMC this week introduced the Greenplum HD Data Computing Appliance, a purpose-built, high-performance, Hadoop appliance for co-processing both structured and unstructured data within a single solution.

EMC also introduced two Hadoop-based software applications.

The EMC Greenplum HD Enterprise Edition software is a 100 percent interface-compatible implementation of the Apache Hadoop stack. EMC said it will provide data management features such as snapshots and wide area replication, simple data loading and access using a native network file system (NFS) interface, and end-to-end management of such items as cluster deployments and automatic failure detection.

The EMC Greenplum HD Community Edition is a 100 percent open-source certified and supported version of the Apache Hadoop stack. Luke Lonergen, CTO and vice president of EMC's Data computing Product division and co-founder of Greenplum, said Hadoop is a movement to get to the core of the deep insight into the massive amounts of data with which customers are struggling.

EMC is bringing its experience in performance and workload management to Hadoop to help customers solve the problem of getting that insight from their data, Lonergen said.

EMC is also working with twelve technology partners with business intelligence, data transfer, and similar technologies to build an ecosystem around its Greenplum Hadoop solutions.

One of those partners, Orlando, Fla.-based business intelligence software developer Pentaho, introduced native adaptor support for Greenplum GPLoad (bulk loader). Pentaho said this enables Greenplum and Pentaho customers to leverage Pentaho’s data integration capabilities to capture, prepare, and manage the movement of large amounts of data into and out of Greenplum.

NetApp this week said its Hadoop storage solution is a preconfigured and modular appliance based on its NetApp E2600 storage system, which it received with the Engenio acquisition. The solution has a base configuration of 16 to 32 nodes, the company said.

Hadoop is still more of a science project, but is already tracking a lot of interest from customers, said Val Bercovici, NetApp's "cloud czar" in the office of the CTO.

By building an appliance version of Hadoop, NetApp will help customers deploy Hadoop clusters in hours versus weeks, Bercovici said.

"There are still a lot of variables," he said. "We will do all the validations and take of the configuration."

NetApp has a slightly different approach to Hadoop than EMC, Bercovici said. NetApp is using the core open-source distribution of Hadoop as developed by Apache, while EMC is basing its offerings on a proprietary branch of the Hadoop project, he said. In addition, EMC is offering multiple versions of Hadoop, which he said can lead to potential confusion.

The EMC Greenplum HD Community Edition, EMC Greenplum HD Enterprise Edition, and EMC Greenplum HD Data Computing Appliance are slated to ship in the third quarter of calendar 2011.

NetApp expects its Hadoop solution to ship in the second quarter of the company's fiscal year 2012.