This topic describes the block storage mode of JindoFS and its use scenarios.

Overview

Block storage is the most efficient mode to read and write data and query metadata. In addition, it supports Hadoop Distributed File System (HDFS) semantics related to data locality. JindoFS also provides an external client so that you can access JindoFS from the outside of an E-MapReduce (EMR) cluster.

JindoFS uses Object Storage Service (OSS) as the storage backend. In block storage mode, JindoFS stores data as blocks in OSS and uses Namespace Service to maintain metadata. This ensures high performance when you read and write data or query metadata.

Scenarios

EMR has three storage systems: EMR OssFileSystem, EMR HDFS, and EMR JindoFS. Among them, OssFileSystem and JindoFS store data in the cloud. The following table compares the features of three EMR storage systems and Hadoop support for Alibaba Cloud OSS.

Feature Hadoop support for Alibaba Cloud OSS E-MapReduce OssFileSystem E-MapReduce HDFS E-MapReduce JindoFS
Storage capacity Tremendous Tremendous Depending on the EMR cluster scale Tremendous
Reliability High High High High
Factor that affects throughput Server I/O performance of caches on disks in the EMR cluster I/O performance of disks in the EMR cluster I/O performance of disks in the EMR cluster
Metadata query efficiency Low Medium High High
Scale-out operation Easy Easy Easy Easy
Scale-in operation Easy Easy Node decommission required Easy
Data locality None Weak Strong Medium

The block storage mode of JindoFS has the following features:

  • JindoFS offers tremendous and scalable storage capacity by using OSS as the storage backend. The storage capacity is independent of the EMR cluster scale. The local cluster can be scaled in or out as required.
  • JindoFS stores some backup data in the local cluster to accelerate read operations. This improves the throughput by using limited local storage capacity, especially for Write Once Read Many (WORM) solutions.
  • JindoFS provides efficient metadata query similar to HDFS. Compared with OssFileSystem, JindoFS saves much time in metadata query. In addition, JindoFS avoids system instability when data and metadata are frequently accessed.
  • JindoFS ensures maximal data locality when jobs are executed in the EMR cluster. This reduces the load on network transmission and improves the read performance.

Configure JindoFS

You can set all parameters related to JindoFS in Bigboot, as shown in the following figures.

Figure 1. Modify a parameter
server_config
Figure 2. Add parameters
cong_sel
Note
  • The parameters framed in red in the preceding figures are required.
  • JindoFS supports multiple namespaces. A namespace named test is used in this topic.
Parameter Description Example
jfs.namespaces The namespace supported by JindoFS. Separate multiple namespaces with commas (,). test
jfs.namespaces.test.uri The storage backend of the test namespace. oss://oss-bucket/oss-dir
Note You can set the value to a directory in an OSS bucket. In this case, this directory serves as the root directory, in which the test namespace reads and writes data.
jfs.namespaces.test.mode The storage mode of the test namespace. block
jfs.namespaces.test.oss.access.key The AccessKey ID used to access the OSS bucket that serves as the storage backend. xxxx
Note We recommend that you store data in an OSS bucket that is in the same region and under the same account as your EMR cluster. This ensures high performance and stability. In this case, you do not need to configure the AccessKey ID and AccessKey secret because the OSS bucket allows password-free access from the EMR cluster.
jfs.namespaces.test.oss.access.secret The AccessKey secret used to access the OSS bucket that serves as the storage backend.

Save and deploy the JindoFS configuration. Restart Namespace Service in SmartData to use JindoFS.

restart

Configure the storage policy

JindoFS provides multiple storage policies to meet different storage needs. The following table lists four available storage policies for a directory.

Policy Description
COLD Data has a backup in OSS but no backups in the local cluster. This policy is suitable for storing cold data.
WARM

The default storage policy.

Data has a backup in OSS and a backup in the local cluster. The local backup can accelerate read operations.

HOT Data has a backup in OSS and multiple backups in the local cluster. Local backups can accelerate read operations on hot data.
TEMP Data has a backup in the local cluster but no backups in OSS. This policy is suitable for storing temporary data. The local backup can accelerate read and write operations on the temporary data. However, this may lower data reliability.

JindoFS provides a command-line tool Admin to configure the storage policy of a directory. The default storage policy is WARM. New files are stored based on the storage policy configured for the parent directory. Run the following command to configure the storage policy:

jindo dfsadmin -R -setStoragePolicy [path] [policy]

Run the following command to obtain the storage policy configured for a directory:

jindo dfsadmin -getStoragePolicy [path]
Note The [path] parameter specifies the directory. The -R option specifies that a recursive operation is performed to configure the same storage policy for all sub-directories of the directory.

The Admin tool provides the archive command to archive cold data.

This command allows you to explicitly evict local blocks. Assume that Hive partitions a table by day. If the data generated a week ago in partitioned tables is infrequently accessed, you can run the archive command on the directory that stores such data on a regular basis. Then, the backups stored in the local cluster are evicted, whereas the backups in OSS are retained.

Run the following archive command:

jindo dfsadmin -archive [path]
Note The [path] parameter specifies the directory in which the data is to be archived.