JindoFileSystem (JindoFS) is a cloud-native file system that combines the advantages of Object Storage Service (OSS) and local storage. JindoFS is also the next-generation storage system that provides efficient and reliable storage services for cloud computing in E-MapReduce. This topic describes how to configure and use JindoFS, and its scenarios.

Overview

JindoFS supports the block storage mode and cache mode.

JindoFS adopts a heterogeneous multi-backup mechanism. Storage Service provides data storage capability. Data is stored in OSS to guarantee high reliability. Redundant backups are stored in the local cluster to accelerate read operations. Namespace Service manages metadata of JindoFS. In this case, metadata is queried from Namespace Service instead of OSS, which improves query performance. This query method of JindoFS is the same as that of Hadoop Distributed File System (HDFS).

Note
  • E-MapReduce V3.20.0 and later support JindoFS. To use JindoFS, select the related services when you create an E-MapReduce cluster.
  • This topic describes how to use JindoFS in E-MapReduce V3.22.0 or later. For more information about how to use JindoFS in E-MapReduce V3.20.0 to V3.22.0 (V3.22.0 excluded), see Use JindoFS in E-MapReduce V3.20.0 to V3.22.0 (V3.22.0 excluded).

Prepare the environment

  • Create an E-MapReduce cluster

    Select E-MapReduce V3.22.0 or later. Select SmartData for Optional Services. For more information about how to create an E-MapReduce cluster, see Create a cluster.

  • Configure JindoFS

    JindoFS provided by SmartData uses OSS as the storage back end. Therefore, you need to set OSS-related parameters before using JindoFS. E-MapReduce provides two configuration methods. If you use the first configuration method, you need to create an E-MapReduce cluster, modify Bigboot-related parameters, and then restart SmartData for the configuration to take effect. If you use the second configuration method, you need to add custom configuration when you create an E-MapReduce cluster. In this case, the related services are restarted based on custom parameters after the E-MapReduce cluster is created.

    • Initialize parameters after the E-MapReduce cluster is created

      You can set all JindoFS-related parameters in Bigboot, as shown in the following figures.

      1. In the Service Configuration section, click bigboot.

      2. Click Custom Configuration.
      Note
      • The parameters framed in red in the preceding figures are required.
      • JindoFS supports multiple namespaces. A namespace named test is used in this topic.
      Parameter Description Example
      jfs.namespaces The namespace supported by JindoFS. Separate multiple namespaces with commas (,). test
      jfs.namespaces.test.uri The storage back end of the test namespace. oss://oss-bucket/oss-dir
      Note You can set the value to a directory in an OSS bucket. In this case, this directory serves as the root directory, in which the test namespace reads and writes data.
      jfs.namespaces.test.mode The storage mode of the test namespace. block
      Note JindoFS supports the block storage mode and cache mode.
      jfs.namespaces.test.oss.access.key The AccessKey ID used to access the OSS bucket that serves as the storage back end. xxxx
      Note We recommend that you select an OSS bucket in the same region and under the same account as the storage back end of the E-MapReduce cluster for better performance and stability. In this case, the E-MapReduce cluster can access the OSS bucket without using the AccessKey ID and AccessKey secret.
      jfs.namespaces.test.oss.access.secret The AccessKey secret used to access the OSS bucket that serves as the storage back end.

      Save and deploy the JindoFS configuration. Restart all components in SmartData to use JindoFS.

    • Add custom configuration when creating an E-MapReduce cluster

      You can add custom configuration when creating an E-MapReduce cluster. For example, you want to create an E-MapReduce cluster in the same region as an OSS bucket to access OSS without using an AccessKey. As shown in the following figure, turn on Custom Software Settings. Add the following configuration to the field in the Advanced Settings section to customize parameters for the test namespace:

      [
      {
      "ServiceName":"BIGBOOT",
      "FileName":"bigboot",
      "ConfigKey":"jfs.namespaces","ConfigValue":"test"
      },{
      "ServiceName":"BIGBOOT",
      "FileName":"bigboot",
      "ConfigKey":"jfs.namespaces.test.uri",
      "ConfigValue":"oss://oss-bucket/oss-dir"
      },{
      "ServiceName":"BIGBOOT",
      "FileName":"bigboot",
      "ConfigKey":"jfs.namespaces.test.mode",
      "ConfigValue":"block"
      }
      ]

Use JindoFS

The use of JindoFS is similar to that of HDFS. JindoFS also provides a prefix. To use JindoFS, you only need to replace the hdfs prefix with the jfs prefix.

Currently, JindoFS supports most of the computing components in the E-MapReduce cluster, including Hadoop, Hive, Spark, Flink, Presto, and Impala.

Examples:

  • Run shell commands
    hadoop fs -ls jfs://your-namespace/ 
    hadoop fs -mkdir jfs://your-namespace/test-dir
    hadoop fs -put test.log jfs://your-namespace/test-dir/
    hadoop fs -get jfs://your-namespace/test-dir/test.log . /
  • Run a MapReduce job
    hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar teragen -Dmapred.map.tasks=1000 10737418240 jfs://your-namespace/terasort/input
    hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar terasort -Dmapred.reduce.tasks=1000 jfs://your-namespace/terasort/input jfs://your-namespace/terasort/output
  • Run a Spark SQL test
    CREATE EXTERNAL TABLE IF NOT EXISTS src_jfs (key INT, value STRING) location 'jfs://your-namespace/Spark_sql_test/';

Control disk space usage

The back end of JindoFS is based on OSS that is capable of storing large amounts of data. However, the storage capacity of local disks is limited. Therefore, JindoFS releases data backups that are less frequently accessed. Alibaba Cloud uses node.data-dirs.watermark.high.ratio and node.data-dirs.watermark.low.ratio to adjust the space usage of local disks. The values of both parameters are in the range of 0 to 1 to indicate the percentage of space usage. JindoFS uses the total storage capacity of all data disks by default. The node.data-dirs.watermark.high.ratio parameter specifies the upper limit of space usage on each disk. Less frequently accessed data stored on a disk is released if the space used by JindoFS reaches the upper limit. The node.data-dirs.watermark.low.ratio parameter specifies the lower limit of space usage on each disk. After the space usage of a disk reaches the upper limit, less frequently accessed data is released until the space usage of the disk reaches the lower limit. You can set the upper limit and lower limit to adjust and assign disk space to JindoFS. Make sure that the upper limit is greater than the lower limit.

Configure the storage policy

JindoFS provides multiple storage policies to meet different storage needs. The following table lists four available storage policies for a directory.

Policy Description
COLD Data has only a backup in OSS but no backups in the local cluster. This policy is suitable for storing cold data.
WARM

The default storage policy.

Data has a backup in OSS and a backup in the local cluster. The local backup can accelerate read operations.

HOT Data has a backup in OSS and multiple backups in the local cluster. Local backups can accelerate read operations on hot data.
TEMP Data has only a backup in the local cluster. This policy is suitable for storing temporary data. The local backup can accelerate read and write operations on the temporary data. However, this may lower data reliability.

JindoFS provides a command-line tool Admin to configure the storage policy of a directory. The default storage policy is WARM. New files are stored according to the storage policy configured for the parent directory. Run the following command to configure the storage policy:

jindo dfsadmin -R -setStoragePolicy [path] [policy]

Run the following command to obtain the storage policy configured for a directory:

jindo dfsadmin -getStoragePolicy [path]
Note The [path] parameter specifies the directory. The -R option specifies that a recursive operation is performed to configure the same storage policy for all subdirectories of the directory.

Use the Admin tool

JindoFS supports the archive and jindo commands of the Admin tool.
  • The Admin tool provides the archive command to archive cold data.

    This command allows you to explicitly evict local blocks. Assume that Hive partitions a table by the day. If the data generated a week ago in partitioned tables is infrequently accessed, you can regularly run the archive command on the directory that stores such data. Then, the backups stored in the local cluster are evicted, whereas the backups in OSS are retained.

    Run the following archive command:

    jindo dfsadmin -archive [path]
    Note The [path] parameter specifies the directory in which the data is to be archived.
  • The Admin tool provides the jindo commands to manage metadata of JindoFS for Namespace Service.

    jindo dfsadmin [-options]
    Note You can run the jindo dfsadmin --help command to obtain help information.
The Admin tool provides the diff and sync commands for the cache mode.
  • The diff command is used to display the difference between the data stored in the local cluster and that in OSS.
    jindo dfsadmin -R -diff [path]
    Note By default, you can use the diff command to display the difference between the metadata stored in the local cluster and that in the subdirectories of the directory specified by the [path] parameter. The -R option specifies that a recursive operation is performed to compare the metadata stored in the local cluster with that stored in all subdirectories of the directory specified by the [path] parameter.
  • The sync command is used to synchronize the metadata between the local cluster and OSS.
    jindo dfsadmin -R -sync [path]
    Note The [path] parameter specifies the directory in which the metadata is to be synchronized. By default, you can use the sync command to synchronize the metadata in the subdirectories of the directory specified by the [path] parameter to the local cluster. The -R option specifies that a recursive operation is performed to synchronize the metadata in all subdirectories of the directory specified by the [path] parameter.