This topic describes data storage in E-MapReduce (EMR) clusters, including the supported disk roles and disk types, and Object Storage Service (OSS).
For more information about storage types, storage performance, and limits on storage, see Elastic Block Storage devices.
|System disk||A disk on which the operating system is installed.
By default, the nodes of an EMR cluster use enhanced SSDs as system disks. By default, each node has one system disk.
|Data disk||A disk that is used to store data.
By default, the master node of an EMR cluster uses one cloud disk as data disk, and each core node of an EMR cluster uses four cloud disks as data disks.
Cloud disks and local disks
EMR clusters allow you to use cloud disks and local disks to store data. The following table describes the different types of disks.
|Cloud disk||Cloud disks include standard SSDs, ultra disks, and enhanced SSDs.
Cloud disks are not directly attached to local compute nodes. Instead, these disks access a remote storage node over the network. Each piece of data has two real-time replicas at the backend. If the data is corrupted due to disk damage, EMR automatically uses a replica to restore data.
|Cloud disks have lower IOPS and throughput than local disks. If the volume of your
business data is below the terabyte level, we recommend that you use cloud disks.
Note If the throughput of cloud disks is insufficient, you can create a new cluster and use local disks.
Local disks are directly attached to compute nodes and have better performance than cloud disks. You cannot change the number of local disks. No data backup mechanism is deployed at the backend, and upper-layer software is required to ensure data reliability.
|Local disks are used in the following scenarios: part of the data needs to be cached, temporary testing is required, and terabytes of data need to be stored based on the three-replica mechanism. Local disks increase O&M costs. We recommend that you use OSS or OSS-HDFS to store data. For more information about how to enable OSS-HDFS, see Enable OSS-HDFS and grant access permissions.|
When nodes in an EMR cluster are released, data on all the cloud disks and local disks is cleared. The disks cannot be kept independently and used again. Hadoop HDFS uses all data disks for data storage. Hadoop YARN uses all data disks as temporary storage for computing.
- Read data from HDFS.
- Change the storage type from HDFS to OSS.
- In MapReduce or Hive jobs, you can run HDFS commands to manage data in OSS. Example:
hadoop fs -ls oss://bucket/path hadoop fs -cp hdfs://user/path oss://bucket/path
When you run the commands, you do not need to enter your AccessKey pair or the endpoint of OSS. EMR completes the information by using the data of the cluster owner. However, OSS is not suitable for scenarios that require high IOPS, such as Spark Streaming or HBase scenarios.