This topic describes the block storage mode of JindoFileSystem (JindoFS) and its scenarios.
Block storage is the most efficient mode to read and write data and query metadata. In addition, it supports Hadoop Distributed File System (HDFS) semantics related to data locality. JindoFS also provides an external client so that you can access JindoFS from the outside of an E-MapReduce cluster.
JindoFS uses Object Storage Service (OSS) as the storage back end. In block storage mode, JindoFS stores data as blocks in OSS and uses Namespace Service to maintain metadata. This guarantees high performance when you read and write data or query metadata.
E-MapReduce has three storage systems: E-MapReduce OssFileSystem, E-MapReduce HDFS, and E-MapReduce JindoFS. Among them, OssFileSystem and JindoFS store data in the cloud. The following table compares the features of three E-MapReduce storage systems and Hadoop support for Alibaba Cloud OSS.
|Feature||Hadoop support for Alibaba Cloud OSS||E-MapReduce OssFileSystem||E-MapReduce HDFS||E-MapReduce JindoFS|
|Storage capacity||Tremendous||Tremendous||Depends on the E-MapReduce cluster scale||Tremendous|
|Factor that affects throughput||Server||I/O performance of caches on disks in the E-MapReduce cluster||I/O performance of disks in the E-MapReduce cluster||I/O performance of disks in the E-MapReduce cluster|
|Metadata query efficiency||Low||Medium||High||High|
|Scale-in operation||Easy||Easy||Requires node decommission||Easy|
The block storage mode of JindoFS has the following features:
- JindoFS offers tremendous and scalable storage capacity by using OSS as the storage back end. The storage capacity is independent of the E-MapReduce cluster scale. The local cluster can be scaled in or out as required.
- JindoFS stores a certain amount of backup data in the local cluster to accelerate read operations. This improves the throughput by using limited local storage capacity, especially for Write Once Read Many (WORM) solutions.
- JindoFS provides efficient metadata query similar to HDFS. Compared with OssFileSystem, JindoFS saves much time in metadata query. In addition, JindoFS avoids system instability when data and metadata are frequently accessed.
- JindoFS moves computation as close as possible to data. This reduces the load on network transmission and improves the read performance.
You can set all JindoFS related-parameters in Bigboot, as shown in the following figure.
- The parameters framed in red in the preceding figure are required.
- JindoFS supports multiple namespaces. A namespace named test is used in this topic.
|jfs.namespaces||The namespace supported by JindoFS. Separate multiple namespaces with commas (,).||test|
|jfs.namespaces.test.uri||The storage back end of the test namespace.||oss://oss-bucket/oss-dir
Note You can set the value to a directory in an OSS bucket. In this case, this directory serves as the root directory, in which the test namespace reads and writes data.
|jfs.namespaces.test.mode||The storage mode of the test namespace.||block|
|jfs.namespaces.test.oss.access.key||The AccessKey ID used to access the OSS bucket that serves as the storage back end.||xxxx
Note We recommend that you select an OSS bucket in the same region and under the same account as the storage back end of the E-MapReduce cluster for better performance and stability. In this case, the E-MapReduce cluster can access the OSS bucket without using the AccessKey ID and AccessKey secret.
|jfs.namespaces.test.oss.access.secret||The AccessKey secret used to access the OSS bucket that serves as the storage back end.|
Save and deploy the JindoFS configuration. Restart Namespace Service in SmartData to use JindoFS.
Configure the storage policy
JindoFS provides multiple storage policies to meet different storage needs. The following table lists four available storage policies for a directory.
|COLD||Data has only a backup in OSS but no backups in the local cluster. This policy is suitable for storing cold data.|
The default storage policy.
Data has a backup in OSS and a backup in the local cluster. The local backup can accelerate read operations.
|HOT||Data has a backup in OSS and multiple backups in the local cluster. Local backups can accelerate read operations on hot data.|
|TEMP||Data has only a backup in the local cluster. This policy is suitable for storing temporary data. The local backup can accelerate read and write operations on the temporary data. However, this may lower data reliability.|
JindoFS provides a command-line tool Admin to configure the storage policy of a directory. The default storage policy is WARM. New files are stored according to the storage policy configured for the parent directory. Run the following command to configure the storage policy:
jindo dfsadmin -R -setStoragePolicy [path] [policy]
Run the following command to obtain the storage policy configured for a directory:
jindo dfsadmin -getStoragePolicy [path]
The Admin tool provides the archive command to archive cold data.
This command allows you to explicitly evict local blocks. Assume that Hive partitions a table by the day. If the data generated a week ago in partitioned tables is infrequently accessed, you can regularly run the archive command on the directory that stores such data. Then, the backups stored in the local cluster are evicted, whereas the backups in OSS are retained.
Run the following archive command:
jindo dfsadmin -archive [path]