This topic describes how to configure and use JindoFileSystem (JindoFS), and its scenarios.

Overview

JindoFS is a cloud-native file system that combines the advantages of Object Storage Service (OSS) and local storage. JindoFS is also the next-generation storage system that provides efficient and reliable storage services for cloud computing in E-MapReduce.

JindoFS supports the block storage mode and cache mode.

JindoFS adopts a heterogeneous multi-backup mechanism. Storage Service provides data storage capability. Data is stored in OSS to guarantee high reliability. Redundant backups are stored in the local cluster to accelerate read operations. Namespace Service manages metadata of JindoFS. In this case, metadata is queried from Namespace Service instead of OSS, which improves query performance. This query method of JindoFS is the same as that of Hadoop Distributed File System (HDFS).

Note
  • E-MapReduce V3.20.0 and later support JindoFS. To use JindoFS, select the related services when you create an E-MapReduce cluster.
  • This topic describes how to use JindoFS in E-MapReduce V3.20.0 to V3.22.0 (V3.22.0 excluded). For more information about how to use JindoFS in E-MapReduce V3.22.0 or later, see Use JindoFS in E-MapReduce V3.22.0 or later.

Scenarios

E-MapReduce has three storage systems: E-MapReduce OssFileSystem, E-MapReduce HDFS, and E-MapReduce JindoFS. Among them, OssFileSystem and JindoFS store data in the cloud. The following table compares the features of three E-MapReduce storage systems and Hadoop support for Alibaba Cloud OSS.

Feature Hadoop support for Alibaba Cloud OSS E-MapReduce OssFileSystem E-MapReduce HDFS E-MapReduce JindoFS
Storage capacity Tremendous Tremendous Depends on the E-MapReduce cluster scale Tremendous
Reliability High High High High
Factor that affects throughput Server I/O performance of caches on disks in the E-MapReduce cluster I/O performance of disks in the E-MapReduce cluster I/O performance of disks in the E-MapReduce cluster
Metadata query efficiency Low Medium High High
Scale-out operation Easy Easy Easy Easy
Scale-in operation Easy Easy Requires node decommission Easy
Data locality None Weak Strong Medium

The block storage mode of JindoFS has the following features:

  • JindoFS offers tremendous and scalable storage capacity by using OSS as the storage back end. The storage capacity is independent of the E-MapReduce cluster scale. The local cluster can be scaled in or out as required.
  • JindoFS stores a certain amount of backup data in the local cluster to accelerate read operations. This improves the throughput by using limited local storage capacity, especially for Write Once Read Many (WORM) solutions.
  • JindoFS provides efficient metadata query similar to HDFS. Compared with OssFileSystem, JindoFS saves much time in metadata query. In addition, JindoFS avoids system instability when data and metadata are frequently accessed.
  • JindoFS moves computation as close as possible to data. This reduces the load on network transmission and improves the read performance.

Prepare the environment

  • Create an E-MapReduce cluster

    Select a version from E-MapReduce V3.20.0 to V3.22.0 (V3.22.0 excluded). Select SmartData and Bigboot for Optional Services. For more information about how to create an E-MapReduce cluster, see Create an E-MapReduce cluster. Bigboot provides distributed data management and component management services in E-MapReduce. Based on Bigboot, SmartData provides JindoFS for the application layer.

    creat_cluster
  • Configure JindoFS
    JindoFS provided by SmartData uses OSS as the storage back end. Therefore, you need to set OSS-related parameters before using JindoFS. E-MapReduce provides two configuration methods. If you use the first configuration method, you need to create an E-MapReduce cluster, modify Bigboot-related parameters, and then restart SmartData for the configuration to take effect. If you use the second configuration method, you need to add custom configuration when you create an E-MapReduce cluster. In this case, the related services are restarted based on custom parameters after the E-MapReduce cluster is created.
    • Initialize parameters after the E-MapReduce cluster is created

      You can set all JindoFS-related parameters in Bigboot. As shown in the following figure, the parameters framed in red are required. The oss.access.bucket parameter specifies the name of the OSS bucket. The oss.data-dir parameter specifies the directory of JindoFS in the OSS bucket. The directory only serves as the storage back end for JindoFS. The data generated in the directory cannot be damaged. The directory is automatically created when JindoFS writes data. You are not required to create the directory in advance. The oss.access.endpoint parameter specifies the region of the OSS bucket. The oss.access.key and oss.access.secret parameters specify the AccessKey ID and AccessKey secret used to access the OSS bucket, respectively. We recommend that you select an OSS bucket in the same region and under the same account as the storage back end of the E-MapReduce cluster for better performance and stability. In this case, the E-MapReduce cluster can access the OSS bucket without using the AccessKey ID and AccessKey secret.

      Note
      • The parameters framed in red in the preceding figure are required.
      • JindoFS supports multiple namespaces. A namespace named test is used in this topic.

      Save and deploy the JindoFS configuration. Restart all components in SmartData to use JindoFS.

    • Add custom configuration when creating an E-MapReduce cluster
      You can add custom configuration when creating an E-MapReduce cluster. For example, you want to create an E-MapReduce cluster in the same region as an OSS bucket to access OSS without using an AccessKey. As shown in the following figure, turn on Custom Software Settings. Add the following configuration including oss.data-dir and oss.access.bucket to the field in the Advanced Settings section:
      [
          {         
          "ServiceName":"BIGBOOT",
          "FileName":"bigboot",
          "ConfigKey":"oss.data-dir",
          "ConfigValue":"jindoFS-1"
          },
          {
          "ServiceName":"BIGBOOT",
          "FileName":"bigboot",
          "ConfigKey":"oss.access.bucket",
          "ConfigValue":"oss-bucket-name"
          }
      ]

Use JindoFS

The use of JindoFS is similar to that of HDFS. JindoFS also provides a prefix. To use JindoFS, you only need to replace the hdfs prefix with the jfs prefix. Example:
hadoop fs -ls jfs:/// hadoop fs -mkdir jfs:///test-dirhadoop fs -put test.log jfs:///test-dir/
Data can be read from JindoFS only when Hadoop, Hive, and Spark jobs are running in the E-MapReduce cluster.

Control disk space usage

The back end of JindoFS is based on OSS that is capable of storing large amounts of data. However, the storage capacity of local disks is limited. Therefore, JindoFS releases data backups that are less frequently accessed. Alibaba Cloud uses node.data-dirs.watermark.high.ratio and node.data-dirs.watermark.low.ratio to adjust the space usage of local disks. The values of both parameters are in the range of 0 to 1 to indicate the percentage of space usage. JindoFS uses the total storage capacity of all data disks by default. The node.data-dirs.watermark.high.ratio parameter specifies the upper limit of space usage on each disk. Less frequently accessed data stored on a disk is released if the space used by JindoFS reaches the upper limit. The node.data-dirs.watermark.low.ratio parameter specifies the lower limit of space usage on each disk. After the space usage of a disk reaches the upper limit, less frequently accessed data is released until the space usage of the disk reaches the lower limit. You can set the upper limit and lower limit to adjust and assign disk space to JindoFS. Make sure that the upper limit is greater than the lower limit.

Configure the storage policy

JindoFS provides multiple storage policies to meet different storage needs. The following table lists four available storage policies for a directory.

Policy Description
COLD Data has only a backup in OSS but no backups in the local cluster. This policy is suitable for storing cold data.
WARM

The default storage policy.

Data has a backup in OSS and a backup in the local cluster. The local backup can accelerate read operations.

HOT Data has a backup in OSS and multiple backups in the local cluster. Local backups can accelerate read operations on hot data.
TEMP Data has only a backup in the local cluster. This policy is suitable for storing temporary data. The local backup can accelerate read and write operations on the temporary data. However, this may lower data reliability.

JindoFS provides a command-line tool Admin to configure the storage policy of a directory. The default storage policy is WARM. New files are stored according to the storage policy configured for the parent directory. Run the following command to configure the storage policy:

jindo dfsadmin -R -setStoragePolicy [path] [policy]

Run the following command to obtain the storage policy configured for a directory:

jindo dfsadmin -getStoragePolicy [path]
Note The [path] parameter specifies the directory. The -R option specifies that a recursive operation is performed to configure the same storage policy for all subdirectories of the directory.

The Admin tool provides the archive command to archive cold data.

This command allows you to explicitly evict local blocks. Assume that Hive partitions a table by the day. If the data generated a week ago in partitioned tables is infrequently accessed, you can regularly run the archive command on the directory that stores such data. Then, the backups stored in the local cluster are evicted, whereas the backups in OSS are retained.

Run the following archive command:

jindo dfsadmin -archive [path]
Note The [path] parameter specifies the directory in which the data is to be archived.