This topic describes the block storage mode of JindoFileSystem (JindoFS) and its scenarios.

Overview

Block storage is the most efficient mode to read and write data and query metadata. In addition, it supports Hadoop Distributed File System (HDFS) semantics related to data locality. JindoFS also provides an external client so that you can access JindoFS from the outside of an E-MapReduce cluster.

JindoFS uses Object Storage Service (OSS) as the storage back end. In block storage mode, JindoFS stores data as blocks in OSS and uses Namespace Service to maintain metadata. This guarantees high performance when you read and write data or query metadata.

Scenarios

E-MapReduce has three storage systems: E-MapReduce OssFileSystem, E-MapReduce HDFS, and E-MapReduce JindoFS. Among them, OssFileSystem and JindoFS store data in the cloud. The following table compares the features of three E-MapReduce storage systems and Hadoop support for Alibaba Cloud OSS.

Feature Hadoop support for Alibaba Cloud OSS E-MapReduce OssFileSystem E-MapReduce HDFS E-MapReduce JindoFS
Storage capacity Tremendous Tremendous Depends on the E-MapReduce cluster scale Tremendous
Reliability High High High High
Factor that affects throughput Server I/O performance of caches on disks in the E-MapReduce cluster I/O performance of disks in the E-MapReduce cluster I/O performance of disks in the E-MapReduce cluster
Metadata query efficiency Low Medium High High
Scale-out operation Easy Easy Easy Easy
Scale-in operation Easy Easy Requires node decommission Easy
Data locality None Weak Strong Medium

The block storage mode of JindoFS has the following features:

  • JindoFS offers tremendous and scalable storage capacity by using OSS as the storage back end. The storage capacity is independent of the E-MapReduce cluster scale. The local cluster can be scaled in or out as required.
  • JindoFS stores a certain amount of backup data in the local cluster to accelerate read operations. This improves the throughput by using limited local storage capacity, especially for Write Once Read Many (WORM) solutions.
  • JindoFS provides efficient metadata query similar to HDFS. Compared with OssFileSystem, JindoFS saves much time in metadata query. In addition, JindoFS avoids system instability when data and metadata are frequently accessed.
  • JindoFS moves computation as close as possible to data. This reduces the load on network transmission and improves the read performance.

Configure JindoFS

You can set all JindoFS related-parameters in Bigboot, as shown in the following figure.

Note
  • The parameters framed in red in the preceding figure are required.
  • JindoFS supports multiple namespaces. A namespace named test is used in this topic.
Parameter Description Example
jfs.namespaces The namespace supported by JindoFS. Separate multiple namespaces with commas (,). test
jfs.namespaces.test.uri The storage back end of the test namespace. oss://oss-bucket/oss-dir
Note You can set the value to a directory in an OSS bucket. In this case, this directory serves as the root directory, in which the test namespace reads and writes data.
jfs.namespaces.test.mode The storage mode of the test namespace. block
jfs.namespaces.test.oss.access.key The AccessKey ID used to access the OSS bucket that serves as the storage back end. xxxx
Note We recommend that you select an OSS bucket in the same region and under the same account as the storage back end of the E-MapReduce cluster for better performance and stability. In this case, the E-MapReduce cluster can access the OSS bucket without using the AccessKey ID and AccessKey secret.
jfs.namespaces.test.oss.access.secret The AccessKey secret used to access the OSS bucket that serves as the storage back end.

Save and deploy the JindoFS configuration. Restart Namespace Service in SmartData to use JindoFS.

Configure the storage policy

JindoFS provides multiple storage policies to meet different storage needs. The following table lists four available storage policies for a directory.

Policy Description
COLD Data has only a backup in OSS but no backups in the local cluster. This policy is suitable for storing cold data.
WARM

The default storage policy.

Data has a backup in OSS and a backup in the local cluster. The local backup can accelerate read operations.

HOT Data has a backup in OSS and multiple backups in the local cluster. Local backups can accelerate read operations on hot data.
TEMP Data has only a backup in the local cluster. This policy is suitable for storing temporary data. The local backup can accelerate read and write operations on the temporary data. However, this may lower data reliability.

JindoFS provides a command-line tool Admin to configure the storage policy of a directory. The default storage policy is WARM. New files are stored according to the storage policy configured for the parent directory. Run the following command to configure the storage policy:

jindo dfsadmin -R -setStoragePolicy [path] [policy]

Run the following command to obtain the storage policy configured for a directory:

jindo dfsadmin -getStoragePolicy [path]
Note The [path] parameter specifies the directory. The -R option specifies that a recursive operation is performed to configure the same storage policy for all subdirectories of the directory.

The Admin tool provides the archive command to archive cold data.

This command allows you to explicitly evict local blocks. Assume that Hive partitions a table by the day. If the data generated a week ago in partitioned tables is infrequently accessed, you can regularly run the archive command on the directory that stores such data. Then, the backups stored in the local cluster are evicted, whereas the backups in OSS are retained.

Run the following archive command:

jindo dfsadmin -archive [path]
Note The [path] parameter specifies the directory in which the data is to be archived.