This topic describes the block storage mode of JindoFS and its use scenarios.
Block storage is the most efficient mode to read and write data and query metadata. In addition, it supports Hadoop Distributed File System (HDFS) semantics related to data locality. JindoFS also provides an external client so that you can access JindoFS from the outside of an E-MapReduce (EMR) cluster.
JindoFS uses Object Storage Service (OSS) as the storage backend. In block storage mode, JindoFS stores data as blocks in OSS and uses Namespace Service to maintain metadata. This ensures high performance when you read and write data or query metadata.
EMR has three storage systems: EMR OssFileSystem, EMR HDFS, and EMR JindoFS. Among them, OssFileSystem and JindoFS store data in the cloud. The following table compares the features of three EMR storage systems and Hadoop support for Alibaba Cloud OSS.
|Feature||Hadoop support for Alibaba Cloud OSS||E-MapReduce OssFileSystem||E-MapReduce HDFS||E-MapReduce JindoFS|
|Storage capacity||Tremendous||Tremendous||Depending on the EMR cluster scale||Tremendous|
|Factor that affects throughput||Server||I/O performance of caches on disks in the EMR cluster||I/O performance of disks in the EMR cluster||I/O performance of disks in the EMR cluster|
|Metadata query efficiency||Low||Medium||High||High|
|Scale-in operation||Easy||Easy||Node decommission required||Easy|
The block storage mode of JindoFS has the following features:
- JindoFS offers tremendous and scalable storage capacity by using OSS as the storage backend. The storage capacity is independent of the EMR cluster scale. The local cluster can be scaled in or out as required.
- JindoFS stores some backup data in the local cluster to accelerate read operations. This improves the throughput by using limited local storage capacity, especially for Write Once Read Many (WORM) solutions.
- JindoFS provides efficient metadata query similar to HDFS. Compared with OssFileSystem, JindoFS saves much time in metadata query. In addition, JindoFS avoids system instability when data and metadata are frequently accessed.
- JindoFS ensures maximal data locality when jobs are executed in the EMR cluster. This reduces the load on network transmission and improves the read performance.
You can set all parameters related to JindoFS in Bigboot, as shown in the following figures.
- The parameters framed in red in the preceding figures are required.
- JindoFS supports multiple namespaces. A namespace named test is used in this topic.
|jfs.namespaces||The namespace supported by JindoFS. Separate multiple namespaces with commas (,).||test|
|jfs.namespaces.test.uri||The storage backend of the test namespace.||oss://oss-bucket/oss-dir
Note You can set the value to a directory in an OSS bucket. In this case, this directory serves as the root directory, in which the test namespace reads and writes data.
|jfs.namespaces.test.mode||The storage mode of the test namespace.||block|
|jfs.namespaces.test.oss.access.key||The AccessKey ID used to access the OSS bucket that serves as the storage backend.||xxxx
Note We recommend that you store data in an OSS bucket that is in the same region and under the same account as your EMR cluster. This ensures high performance and stability. In this case, you do not need to configure the AccessKey ID and AccessKey secret because the OSS bucket allows password-free access from the EMR cluster.
|jfs.namespaces.test.oss.access.secret||The AccessKey secret used to access the OSS bucket that serves as the storage backend.|
Save and deploy the JindoFS configuration. Restart Namespace Service in SmartData to use JindoFS.
Configure the storage policy
JindoFS provides multiple storage policies to meet different storage needs. The following table lists four available storage policies for a directory.
|COLD||Data has a backup in OSS but no backups in the local cluster. This policy is suitable for storing cold data.|
The default storage policy.
Data has a backup in OSS and a backup in the local cluster. The local backup can accelerate read operations.
|HOT||Data has a backup in OSS and multiple backups in the local cluster. Local backups can accelerate read operations on hot data.|
|TEMP||Data has a backup in the local cluster but no backups in OSS. This policy is suitable for storing temporary data. The local backup can accelerate read and write operations on the temporary data. However, this may lower data reliability.|
JindoFS provides a command-line tool Admin to configure the storage policy of a directory. The default storage policy is WARM. New files are stored based on the storage policy configured for the parent directory. Run the following command to configure the storage policy:
jindo dfsadmin -R -setStoragePolicy [path] [policy]
Run the following command to obtain the storage policy configured for a directory:
jindo dfsadmin -getStoragePolicy [path]
The Admin tool provides the archive command to archive cold data.
This command allows you to explicitly evict local blocks. Assume that Hive partitions a table by day. If the data generated a week ago in partitioned tables is infrequently accessed, you can run the archive command on the directory that stores such data on a regular basis. Then, the backups stored in the local cluster are evicted, whereas the backups in OSS are retained.
Run the following archive command:
jindo dfsadmin -archive [path]