Hadoop Distributed File System (HDFS) is a distributed file system designed to run on general-purpose hardware.
HDFS consists of a NameNode, DataNodes, and Clients. The NameNode is deployed on the master node of an E-MapReduce (EMR) cluster and is used to manage metadata and data blocks of files. A DataNode is deployed on each core node of an EMR cluster and is used to manage data blocks on the node. A client is deployed on all the master, core, and task nodes of an EMR cluster. It is also deployed on the gateway cluster that is associated with the EMR cluster.
- By default, HDFS is deployed in high-availability (HA) mode in an HA EMR cluster.
- O&M is convenient. For example, you can add nodes, decommission DataNodes, balance data among DataNodes, and perform a rolling restart on DataNodes in the EMR console.
- Application performance management (APM) is supported. HDFS can monitor various metrics and report alerts based on alert rules.
- HDFS Federation is supported.
Use an HDFS client
- Connect to the master node of your cluster in SSH mode.
For more information about how to connect to the master node, see Connect to the master node of an EMR cluster in SSH mode.
- Run an HDFS shell command. Example:
hdfs dfs -ls /Information similar to the following output is returned:
Found 7 items drwxr-xr-x - hadoop hadoop 0 2021-03-13 07:54 /apps drwxrwxrwx - flowagent hadoop 0 2021-03-13 07:53 /emr-flow drwxr-x--x - root hadoop 0 2021-03-13 07:53 /emr-sparksql-udf drwxr-x--x - hadoop hadoop 0 2021-03-14 07:34 /flink drwxr-x--x - hadoop hadoop 0 2021-03-13 07:55 /spark-history drwxrwxrwt - root hadoop 0 2021-03-17 11:42 /tmp drwxr-x--t - hadoop hadoop 0 2021-03-13 07:55 /user
For a non-HA cluster, you can access an HDFS client by using
hdfs://emr-header-1.cluster-xxxx:9000/. For an HA cluster, you can access an HDFS client by using