Big Data Buzzwords From A to Z4:00 PM EST Wed. Nov. 28, 2012
Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation of technology to handle it. And, with new technologies come new buzzwords: acronyms, technical terms, product names, etc.
Even the phrase "big data" itself can be confusing. Many think of "lots of data" when they hear it, but big data is much more than just data volume.
Here, in alphabetical order, are some of the buzzwords we think you need to be familiar with.
An acronym for Atomicity, Consistency, Isolation and Durability, ACID is a set of requirements or properties that, when adhered to, ensure the data integrity of database transactions during processing. While ACID has been around for a while, the explosion in transaction data volumes has focused more attention on the need for meeting ACID provisions when working with big data.
IT systems today pump out data that's "big" on volume, velocity and variety.
Volume: IDC estimates that the volume of world information will reach 2.7 zettabytes this year (that's 2.7 billion terabytes) and that's doubling every two years.
Velocity: It's not just the amount of data that's causing headaches for IT managers, but the increasingly rapid speed at which data is flowing from financial systems, retail systems, websites, sensors, RFID chips and social networks like Facebook, Twitter, etc.
Variety: Going back five, maybe 10 years, IT mostly dealt with alphanumeric data that was easy to store in neat rows and columns in relational databases. No longer. Today, unstructured data, such as Tweets and Facebook posts, documents, Web content and so on, is all part of the big data mix.
Some new-generation databases (such as the open-source Cassandra and HP's Vertica) are designed to store data by column rather than by row as traditional SQL databases do. Their design provides faster disk access, improving their performance when handling big data. Columnar databases are especially popular for data-intensive business analytics applications.
The concept of data warehousing, copying data from multiple operational IT systems into a secondary, off-line database for business analytics applications, has been around for about 25 years.
But as data volumes explode, data warehouse systems are rapidly changing. They need to store more data -- and more kinds of data -- making their management a challenge. And where 10 or 20 years ago data might have been copied into a data warehouse system on a weekly or monthly basis, data warehouses today are refreshed far more frequently with some even updated in real time.
Extract, transform and load (ETL) software is used when moving data from one database, such as one supporting a banking application transaction processing system, to another, such as a data warehouse system used for business analytics. Data often needs to be reformatted and cleaned up when being transferred from one database to another.
The performance demands on ETL tools have increased as data volumes have grown exponentially and data processing speeds have accelerated.
Flume, a technology in the Apache Hadoop family (others include HBase, Hive, Oozie, Pig and Whirr), is a framework for populating Hadoop with data. The technology uses agents scattered across application servers, Web servers, mobile devices and other systems to collect data and transfer it to a Hadoop system.
A business, for example, could use Apache Flume running on a Web server to collect data from Twitter posts for analysis.
One trend fueling big data is the increasing volume of geospatial data being generated and collected by IT systems today. A picture may be worth 1,000 words, so it's no surprise the growing number of maps, charts, photographs and other geographic-based content is a major driver of today's big data explosion.
Geospatial analysis is a specific form of data visualization (see "V" for visualization) that overlays data on geographical maps to help users better understand the results of big data analysis.
Hadoop is an open-source platform for developing distributed, data-intensive applications. It's controlled by the Apache Software Foundation.
Hadoop was created by Yahoo developer Doug Cutting, who based it on Google Labs' MapReduce concept and named it after his infant son's toy elephant.
Bonus "H" entries, or HBase, is a non-relational database developed as part of the Hadoop project. The Hadoop Distributed Filesystem (HDFS) is a key component of Hadoop. And, Hive is a data warehouse system built on Hadoop.
Computers generally retrieve data from disk drives as they process transactions or perform queries. But, that can be too slow when IT systems are working with big data.
In-memory database systems utilize a computer's main memory to store frequently used data, greatly reducing processing times. In-memory database products include SAP HANA and the Oracle Times Ten In-Memory Database.
Java is a programming language developed at Sun Microsystems and released in 1995. Hadoop and a number of other big data technologies were built using Java, and it remains a dominant development technology in the big data world.
Kafka is a high-throughput, distributed messaging system originally developed at LinkedIn to manage the service's activity stream (data about a Website's usage) and operational data processing pipeline (about the performance of server components).
Kafka is effective for processing large volumes of streaming data -- a key issue in many big data computing environments. Storm, developed by Twitter, is another stream-processing technology that's catching on.
The Apache Software Foundation has taken Kafka on as an open-source project. No jokes about buggy software, please ...
Latency is the delay when data is being delivered from one point to another or the amount of delay for a system, such as an application, to respond to another.
While the term isn't new, you're hearing it more often today as data volumes grow and IT systems struggle to keep up. "Low latency" is good; "high latency" is bad.
Map/reduce is a way of breaking up a complex problem into smaller chunks, distributing them across many computers and then reassembling them into a single answer.
Google's search system utilizes map/reduce concepts and the company has a framework with the brand name MapReduce.
In 2004, Google released a white paper describing its use of map/reduce. Doug Cutting recognized its potential and developed the first release of Hadoop that also incorporates map/reduce concepts.
Most mainstream databases (such as the Oracle Database and Microsoft SQL Server) are based on a relational architecture and use structured query language (SQL) for development and data management.
But a new generation of database systems dubbed "NoSQL" (which some now say stands for "Not only SQL") is based on architectures that proponents argue are better for handling big data.
Some NoSQL databases are designed for scalability and flexibility whereas others are more efficient at handling documents and other unstructured data. Examples include Hadoop/HBase, Cassandra, MongoDB and CouchDB, while some big vendors like Oracle have launched their own NoSQL products.
Apache Oozie is an open-source workflow engine that's used to help manage processing jobs for Hadoop. Using Oozie, a series of jobs can be defined in multiple languages, such as Pig and MapReduce, and then linked to each other. That allows a programmer to launch a data analysis query once a job to collect data from an operational application has finished, for example.
Pig, another Apache Software Foundation project, is a platform for analyzing huge data sets. At its core, it's a programming language for developing parallel computation queries that run on Hadoop.
Quantitative data analysis is the use of complex mathematical or statistical modeling to explain financial and business behavior or even predict future behavior.
With the exploding volumes of data being collected today, quantitative data analysis has become more complex. But more data also holds the promise of more data analysis opportunities for companies that know how to use it to gain better visibility and insights into their businesses and spot market trends.
One problem: There's a serious shortage of people with these kinds of analytical skills. Consulting firm McKinsey says there is a need for 1.5 million additional analysts and managers with big data analysis skills in the U.S.
Relational database management systems, including IBM's DB2, Microsoft's SQL Server and the Oracle Database, are the most widely used type of database today. Most corporate transaction processing systems run on RDBMs, from banking applications to retail point-of-sale systems to inventory management applications.
But, some argue that relational databases may be unable to keep up with today's exploding volume and variety of data. RDBMs, for example, were designed with alphanumeric data in mind and aren't as effective when working with unstructured data.
As databases become ever larger, they become more difficult to work with. Sharding is a form of database partitioning that breaks a database up into smaller, more easily managed parts. Specifically, a database is partitioned horizontally to separately manage rows in a database table.
Sharding allows segments of a huge database to be distributed across multiple servers, improving the overall speed and performance of the database.
Bonus "S" entry: Sqoop is an open-source tool for moving data from non-Hadoop sources, such as relational databases, into Hadoop.
One of the contributors to the big data problem is the increasing amount of text being collected from social media sites like Twitter and Facebook, external news feeds and even within a company for analysis. Because text is unstructured (unlike structured data typically stored in relational databases), mainstream business analytics tools often falter when faced with text.
Text analytics uses a range of techniques -- from key word search to statistical analysis to linguistic approaches -- to derive insight from text-based data.
Until recent years, most data was structured, the kind of alphanumeric information (such as financial data from sales transactions) that could be easily stored in a relational database and analyzed by business intelligence tools.
But, a big chunk of the 2.7 zettabytes of stored data today is unstructured, such as text-based documents, tweets, photos posted on Flickr, videos posted on YouTube and so on. (Fun fact: Thirty-five hours of content are uploaded to YouTube every minute.)
Processing, storing and analyzing all that messy unstructured stuff are often challenges for today's IT systems.
As the volume of data grows, it becomes increasingly difficult for people to understand it using static charts and graphs. That's led to the development of a new generation of data visualization and analysis tools that present data in new ways to help people make sense of huge amounts of information.
These tools include color-coded heat maps, three-dimensional graphs, animated visualizations that show changes over time and geospatial representations that overlay data on geographical maps. Today's advanced data visualization tools are also more interactive, such as allowing a user to zoom in on a data subset for closer inspection.
Apache Whirr is a set of libraries for running big data cloud services. More specifically, it speeds up the development of Hadoop clusters on virtual infrastructure such as Amazon EC2 and Rackspace.
Extensible Markup Language is used to transport and store data (not to be confused with HTML, which is used to display data). With XML, programmers can create common data formats and share both the information and the format through the Web.
Because XML documents can be very large and complex, they are often seen as contributing to IT organization's big data challenges.
A yottabyte is a data storage benchmark that's equal to 1,000 zettabytes. The total amount of data stored worldwide is expected to reach 2.7 zettabytes this year, up 48 percent from 2011, according to an IDC calculation. So we're a long way from reaching the yottabyte threshold -- although with the rate of big data growth, it might come sooner than we think.
Just to review, a zettabyte is one sextillion bytes of data. It's equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes.
ZooKeeper was created by the Apache Software Foundation to help Hadoop users manage and coordinate Hadoop nodes across a distributed network.
Closely integrated with HBase, the database associated with Hadoop, ZooKeeper is a centralized service for maintaining configuration information, naming services, distributed synchronization and other group services. IT managers use it to implement reliable messaging, synchronize process execution and implement redundant services.