edit-icon download-icon

Harden Hadoop environment security

Last Updated: May 08, 2018

About Hadoop

Hadoop is an open-source, highly reliable and extensible distributed computing framework developed by the Apache Software Foundation. The core design of the Hadoop framework is HDFS and MapReduce.

HDFS provides massive data storage, and MapReduce provides massive data computing.

  • HDFS is an open-source implementation of Google File System (GFS).
  • MapReduce is a programming model for parallel operations on large-scale datasets (greater than 1 TB).

Hadoop security issues

1. WebUI sensitive information leak

By default, Hadoop opens a lot of ports to provide WebUI features. The following table lists those corresponding open ports:

ModuleNodeDefault port
HDFSNameNode50070
SecondNameNode50090
DataNode50075
Backup/Checkpoint node50105
MapReduceJobTracker50030
TaskTracker50060

You can download any file by accessing the port 50070 of the NameNode WebUI management interface. Additionally, if the DataNode’s default port 50075 is open, attackers can manipulate stored data in HDFS through the Restful API provided by HDFS.

2. MapReduce code execution vulnerability

3. Vulnerabilities of third-party plug-ins in Hadoop

4. Hive arbitrary command/code execution vulnerability

Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools for data Extract-Transform-Load (ETL), and a mechanism for storing, querying and analyzing large-scale data stored in Hadoop. Hive uses a simple quasi-SQL query language called HQL, which allows users who are familiar with the SQL language to query data. Meanwhile, the language also allows developers who are familiar with MapReduce to develop customized mappers and reducers to handle complicated analysis tasks that the built-in mappers and reducers fail to accomplish.

HQL can use the transform command to customize the Map/Reduce scripts used by Hive, replacing them with shell or python scripts. Then, the attacker can obtain server privileges through the Hive interface or other related operations.

The information above shows that exposed service ports harbor severe security risks.

Security hardening solutions

According to those Hadoop security issues, exposing service ports may cause severe security risks in Hadoop environment. To harden the security of the Hadoop environment, apply the following configurations.

1. Network access control

Use the Security Group Firewall or the firewall of the local operating system to manage accessing IP addresses. If your application only provides services for intranet servers, we recommend that you prohibit exposing all Hadoop service ports to the Internet.

2. Enable authentication

Enable the Kerberos authentication protocol in the Hadoop environment.

3. Apply updates

We recommend that you pay attention to the latest Hadoop official releases and apply the updates immediately.

More informaiton

  • Information about all ports in Hadoop

    PortDescription
    9000fs.defaultFS (for example, hdfs://172.25.40.171:9000.)
    9001dfs.namenode.rpc-address (DataNode connects this port.)
    50070dfs.namenode.http-address
    50470dfs.namenode.https-address
    50100dfs.namenode.backup.address
    50105dfs.namenode.backup.http-address
    50090dfs.namenode.secondary.http-address (for example, 172.25.39.166:50090.)
    50091dfs.namenode.secondary.https-address (for example, 172.25.39.166:50091.)
    50020dfs.datanode.ipc.address
    50075dfs.datanode.http.address
    50475dfs.datanode.https.address
    50010dfs.datanode.address (The data transmission port of DataNode)
    8480dfs.journalnode.rpc-address
    8481dfs.journalnode.https-address
    8032yarn.resourcemanager.address
    8088yarn.resourcemanager.webapp.address (The HTTP port of YARN)
    8090yarn.resourcemanager.webapp.https.address
    8030yarn.resourcemanager.scheduler.address
    8031yarn.resourcemanager.resource-tracker.address
    8033yarn.resourcemanager.admin.address
    8042yarn.nodemanager.webapp.address
    8040yarn.nodemanager.localizer.address
    8188yarn.timeline-service.webapp.address
    10020mapreduce.jobhistory.address
    19888mapreduce.jobhistory.webapp.address
    2888ZooKeeper (Used to listen the Follower connections if Leader)
    3888ZooKeeper (Used for the Leader selection)
    2181ZooKeeper (Used to listen the client connections)
    60010hbase.master.info.port (The HTTP port of HMaster)
    60000hbase.master.port (The RPC port of HMaster)
    60030hbase.regionserver.info.port (The HTTP port of HRegionServer)
    60020hbase.regionserver.port (The RPC port of HRegionServer)
    8080hbase.rest.port (The port of the HBase REST server)
    10000hive.server2.thrift.port
    9083hive.metastore.uris
  • Hadoop safari : Hunting for vulnerabilities
  • Hadoop Default Ports Quick Reference
Thank you! We've received your feedback.