5 Security Requirements For Big Data Hadoop Implementations

Hadoop Prompts Data Control, Security Challenges

Using advanced analytics such as Hadoop to mine big data implementations in the enterprise has raised concerns about how to secure and control the data repositories. By distributing data storage and data management across a large number of nodes, you can attack a large problem with parallelism and get results from large bodies of data a lot faster. With Hadoop, there is less dependence on a strict data schema so a greater volume of data can be ingested giving analysts more flexibility and speed. However, the approach can cause challenges in identifying and securing sensitive data, according to Fremont, Calif.-based Dataguise, which sells a data governance and encryption platform.

Thorough Risk Assessment

Organizations should address data privacy and security issues early in the planning phase of a deployment, according to Dataguise. A recent survey conducted by the firm found that the data in most Apache Hadoop projects consist of log files, followed by structured DBMS data and mixed data types. Identify what data elements are defined as sensitive and consider the privacy policies and compliance mandates that need to be heeded by the organization.

Do You Know Where Your Sensitive Data Resides?

Discover whether sensitive data is embedded in the environment, assembled or will be assembled in Hadoop, according to Dataguise. Forward-thinking organizations will have already gone through data-classification projects to understand where the organization is storing its sensitive data and put security controls and policies in place to protect that data. A Dataguise survey found marketing, sales and customer support as the main divisions using Hadoop.

Maintain Observance With Compliance Mandates

Determine the compliance exposure risk based on the information collected, Dataguise said. Personally identifiable information in the form of names, addresses and Social Security numbers may be included in the data. The firm's recent survey found a small number storing sensitive data in Hadoop, including Social Security numbers, credit card numbers and addresses. Organizations shouldn't be storing credit card data, according to the Payment Card Industry Data Security Standards (PCI DSS). Encryption is also a requirement under the standard.

Data Masking And Data Encryption

Data-protection products should support both data-masking and encryption technologies, Dataguise said. Organizations also can choose point products for each remediation technique, but it can be a challenge if the implementation involves keeping both masked and unmasked versions of sensitive data in separate Hadoop directories. Data masking obscures sensitive data elements within data stores with false information, but keeps it meaningful for the application logic. For accuracy, masking must be consistent across all data files. Determine whether business analytic needs require access to real data or if desensitized data can be used, Dataguise said.

Encryption Must Work With Access Control Technology

The encryption and the access-control technology must allow users with different credentials to have the appropriate selective access to data in the Hadoop cluster, Dataguise said. Carefully deploy encryption. Security experts say poorly implemented encryption is a huge weakness that could make the deployment meaningless.