This topic describes data storage in E-MapReduce (EMR) clusters, including the supported disk roles and disk types, and Object Storage Service (OSS).

Background information

For more information about storage types, storage performance, and limits on storage, see Elastic Block Storage devices.

Storage prices:
  • Local disk storage: USD 0.003/GB/month
  • OSS Standard storage: USD 0.02/GB/month
  • OSS Archive storage: USD 0.045/GB/month
  • OSS Cold Archive storage: USD 0.002/GB/month
  • Ultra disk storage: USD 0.05/GB/month
  • Standard SSD storage: USD 0.143/GB/month
Note The actual price is displayed in the console when you purchase a cluster.

For information about the pricing of ESSDs at different PLs, see the Pricing tab on the Elastic Compute Service page.

Disk roles

Each node in an EMR cluster has two disk roles: system disk and data disk. The disks can vary in terms of configuration, type, and capacity.
Disk roleDescription
System diskA disk on which the operating system is installed.

By default, the nodes of an EMR cluster use enhanced SSDs as system disks. By default, each node has one system disk.

Data diskA disk that is used to store data.

By default, the master node of an EMR cluster uses one cloud disk as data disk, and each core node of an EMR cluster uses four cloud disks as data disks.

Cloud disks and local disks

EMR clusters allow you to use cloud disks and local disks to store data. The following table describes the different types of disks.

Disk typeDescriptionScenario
Cloud diskCloud disks include standard SSDs, ultra disks, and enhanced SSDs.

Cloud disks are not directly attached to local compute nodes. Instead, these disks access a remote storage node over the network. Each piece of data has two real-time replicas at the backend. If the data is corrupted due to disk damage, EMR automatically uses a replica to restore data.

Cloud disks have lower IOPS and throughput than local disks. If the volume of your business data is below the terabyte level, we recommend that you use cloud disks.
Note If the throughput of cloud disks is insufficient, you can create a new cluster and use local disks.
Local disk

Local disks are directly attached to compute nodes and have better performance than cloud disks. You cannot change the number of local disks. No data backup mechanism is deployed at the backend, and upper-layer software is required to ensure data reliability.

Local disks are used in the following scenarios: part of the data needs to be cached, temporary testing is required, and terabytes of data need to be stored based on the three-replica mechanism. Local disks increase O&M costs. We recommend that you use OSS or OSS-HDFS to store data. For more information about how to enable OSS-HDFS, see Enable OSS-HDFS and grant access permissions.

When nodes in an EMR cluster are released, data on all the cloud disks and local disks is cleared. The disks cannot be kept independently and used again. Hadoop HDFS uses all data disks for data storage. Hadoop YARN uses all data disks as temporary storage for computing.

OSS

OSS can be used as HDFS in an EMR cluster. You can read data from or write data to OSS by modifying the code that is originally used to access HDFS. Examples:
  • Read data from HDFS.
    sc.textfile("hdfs://bucket/path")
  • Change the storage type from HDFS to OSS.
    sc.textfile("oss://bucket/path")
  • In MapReduce or Hive jobs, you can run HDFS commands to manage data in OSS. Example:
    hadoop fs -ls oss://bucket/path
    hadoop fs -cp hdfs://bucket/path  oss://bucket/path

    When you run the commands, you do not need to enter your AccessKey pair or the endpoint of OSS. EMR completes the information by using the data of the cluster owner. However, OSS is not suitable for scenarios that require high IOPS, such as Spark Streaming or HBase scenarios.