Hadoop Distributed File System (HDFS) is a distributed file system designed to run on general-purpose hardware.

Background information

HDFS consists of a NameNode, DataNodes, and Clients. The NameNode is deployed on the master node of an E-MapReduce (EMR) cluster and is used to manage metadata and data blocks of files. A DataNode is deployed on each core node of an EMR cluster and is used to manage data blocks on the node. A client is deployed on all the master, core, and task nodes of an EMR cluster. It is also deployed on the gateway cluster that is associated with the EMR cluster.

The following figure shows the HDFS architecture. HDFS

Benefits

HDFS in an EMR cluster has the following benefits:
  • By default, HDFS is deployed in high-availability (HA) mode in an HA EMR cluster.
  • O&M is convenient. For example, you can add nodes, decommission DataNodes, balance data among DataNodes, and perform a rolling restart on DataNodes in the EMR console.
  • Application performance management (APM) is supported. HDFS can monitor various metrics and report alerts based on alert rules.
  • HDFS Federation is supported.

Use an HDFS client

  1. Connect to the master node of your cluster in SSH mode.

    For more information about how to connect to the master node, see Connect to the master node of an EMR cluster in SSH mode.

  2. Run an HDFS shell command. Example:
    hdfs dfs -ls /
    Information similar to the following output is returned:
    Found 7 items
    drwxr-xr-x   - hadoop    hadoop          0 2021-03-13 07:54 /apps
    drwxrwxrwx   - flowagent hadoop          0 2021-03-13 07:53 /emr-flow
    drwxr-x--x   - root      hadoop          0 2021-03-13 07:53 /emr-sparksql-udf
    drwxr-x--x   - hadoop    hadoop          0 2021-03-14 07:34 /flink
    drwxr-x--x   - hadoop    hadoop          0 2021-03-13 07:55 /spark-history
    drwxrwxrwt   - root      hadoop          0 2021-03-17 11:42 /tmp
    drwxr-x--t   - hadoop    hadoop          0 2021-03-13 07:55 /user

    For a non-HA cluster, you can access an HDFS client by using hdfs://emr-header-1.cluster-xxxx:9000/. For an HA cluster, you can access an HDFS client by using hdfs://emr-cluster/.