This topic describes data storage in E-MapReduce (EMR) clusters, including the supported disk roles and disk types, and Object Storage Service (OSS).

Background information

For more information about storage types, storage performance, and limits on storage, see Elastic Block Storage devices.

Disk roles

Each node in an EMR cluster has two disk roles: system disk and data disk. The disks can vary in terms of configuration, type, and capacity.
Disk role Description
System disk A disk on which the operating system is installed.

By default, the nodes of an EMR cluster use enhanced SSDs as system disks. By default, each node has one system disk.

Data disk A disk that is used to store data.

By default, the master node of an EMR cluster uses one cloud disk as data disk, and each core node of an EMR cluster uses four cloud disks as data disks.

Cloud disks and local disks

EMR clusters allow you to use cloud disks and local disks to store data. The following table describes the different types of disks.

Disk type Description Scenario
Cloud disk Cloud disks include standard SSDs, ultra disks, and enhanced SSDs.

Cloud disks are not directly attached to local compute nodes. Instead, these disks access a remote storage node over the network. Each piece of data has two real-time replicas at the backend. If the data is corrupted due to disk damage, EMR automatically uses a replica to restore data.

Cloud disks have lower IOPS and throughput than local disks. If the volume of your business data is below the terabyte level, we recommend that you use cloud disks.
Note If the throughput of cloud disks is insufficient, you can create a new cluster and use local disks.
Local disk Local disks include local SATA disks and local SSDs of the big data type.

Local disks are directly attached to compute nodes and have better performance than cloud disks. You cannot change the number of local disks. No data backup mechanism is deployed at the backend, and upper-layer software is required to ensure data reliability.

The data reliability of local disks is ensured by EMR. If the volume of your business data is at the terabyte level or higher, we recommend that you use local disks.

When nodes in an EMR cluster are released, data on all the cloud disks and local disks is cleared. The disks cannot be kept independently and used again. Hadoop HDFS uses all data disks for data storage. Hadoop YARN uses all data disks as temporary storage for computing.

OSS

OSS can be used as HDFS in an EMR cluster. You can read data from or write data to OSS by modifying the code that is originally used to access HDFS. Examples:
  • Read data from HDFS:
    sc.textfile("hdfs://user/path")
  • Change the storage type from HDFS to OSS:
    sc.textfile("oss://user/path")
  • In MapReduce or Hive jobs, you can run HDFS commands to manage data in OSS. Example:
    hadoop fs -ls oss://bucket/path
    hadoop fs -cp hdfs://user/path  oss://bucket/path

    When you run the commands, you do not need to enter your AccessKey pair or the endpoint of OSS. EMR completes the information by using the data of the cluster owner. However, OSS is not suitable for scenarios that require high IOPS, such as Spark Streaming or HBase scenarios.