This topic describes how to use the transparent caching feature of JindoCache to accelerate access to OSS-HDFS. Storage resources of E-MapReduce (EMR) clusters are used to cache OSS-HDFS objects.
Prerequisites
An EMR cluster is created and JindoCache is selected when you create the cluster. For more information, see Create a cluster.
OSS-HDFS is enabled and access permissions on OSS-HDFS are granted. For more information, see Enable OSS-HDFS and grant access permissions.
Procedure
Select caching policies.
JindoCache uses CacheSets to manage different caching policies. You can select different caching policies for data that is stored in different paths based on your business requirements. JindoCache supports one or more CacheSets.
Log on to your cluster. For more information, see Log on to a cluster.
Add the
cacheset.xml
file.In this example, the
cacheset.xml
file is stored in the/path
directory.<?xml version="1.0" encoding="UTF-8"?> <cachesets> <cacheset> <name>name1</name> <path>oss://emr-test/dir1</path> <cacheStrategy>DISTRIBUTED</cacheStrategy> <metaPolicy> <type>ALWAYS</type> </metaPolicy> <readPolicy>CACHE_ASIDE</readPolicy> <writePolicy>WRITE_AROUND</writePolicy> </cacheset> <cacheset> <name>name2</name> <path>oss://emr-test/dir2</path> <cacheStrategy>DHT</cacheStrategy> <metaPolicy> <type>ONCE</type> </metaPolicy> <readPolicy>CACHE_ASIDE</readPolicy> <writePolicy>WRITE_AROUND</writePolicy> </cacheset> </cachesets>
Parameter
Description
Example
name
The name of the CacheSet. The name must be unique. If the CacheSet already exists, the existing configurations are overwritten.
name1
path
The parent path to store the CacheSet. Policies that are managed by the CacheSet are used for data in child paths under this parent path.
oss://emr-test/dir1
cacheStrategy
The caching policy. Valid values: DISTRIBUTED and DHT. DHT is short for distributed hash table. The DHT caching policy is used to accelerate access to small-sized, read-only files.
You can select a policy based on your business requirements.
DISTRIBUTED
metaPolicy
The metadata caching policy. Valid values:
ALWAYS: Metadata is not cached and is read from the remote storage.
ONCE: Metadata is cached. Metadata is always read from the local storage after the metadata is read from the remote storage for the first time.
You can select a policy based on your business requirements.
NoteIf the cacheStrategy parameter is set to DHT, this parameter must be set to ONCE.
ALWAYS
readPolicy
The policy for reading files. This parameter can be set only to CACHE_ASIDE, which indicates that files are preferentially read from the cache.
CACHE_ASIDE
writePolicy
The policy for writing files. Valid values:
WRITE_AROUND: writes files to the remote storage.
CACHE_ONLY: writes files to the cache.
NoteIf you set this parameter to CACHE_ONLY, the metaPolicy parameter must be set to ONCE.
WRITE_THROUGH: writes files to both the cache and remote storage.
WRITE_AROUND
Run the following command to refresh CacheSets in JindoCache:
jindocache -refreshCacheSet -path /path/cacheset.xml
If the execution is successful, the output contains the
Successfully refresh cacheset !!!
information. For information about JindoCache-relevant commands, see JindoCache CLI usage notes.Run the following command to view information about CacheSets in the system:
jindocache -listCacheSet
Configure JindoSDK.
Configure the OSS-HDFS implementation class of JindoCache in Hadoop-Common. Go to the core-site.xml tab on the Configure tab of the Hadoop-Common service page in the EMR console and modify the configuration item that is described in the following table. For more information, see Manage configuration items.
Configuration item
Description
fs.xengine
The value is fixed as jindocache.
If you leave this configuration item empty, the client will no longer cache data, but directly communicate with the backend.
NoteIn this step, the configuration item is configured on the client. The configuration takes effect without the need to restart JindoCache.
After the preceding configuration is complete, you can run jobs to access OSS-HDFS by using the caching feature. The transparent caching feature of JindoCache allows you to access OSS-HDFS without the need to modify job configurations. After you run a job to read data from OSS-HDFS, the data is automatically cached to JindoCache. Then, when you access the same data, the cache can be hit and the data reading performance can be improved.
FAQ
How do I configure an AccessKey pair that is used to access OSS-HDFS?
JindoCache allows you to access OSS-HDFS in password-free mode. If you want to access OSS-HDFS across accounts, you must configure information, such as the AccessKey ID, AccessKey secret, and endpoint for authorization.
Go to the common tab on the Configure tab of the JindoCache service page.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
On the EMR on ECS page, find the desired cluster and click Services in the Actions column.
On the Services tab, find the JindoCache service and click Configure.
On the Configure tab, click the common tab.
Add configuration items and make the configurations take effect.
On the common tab, click Add Configuration Item.
In the Add Configuration Item dialog box, add the configuration items that are described in the following table.
For more information about how to add configuration items and make configurations take effect, see Manage configuration items.
NoteYYY
specifies the name of a bucket for which OSS-HDFS is enabled.Configuration item
Description
jindocache.oss.bucket.YYY.accessKeyId
The AccessKey ID that is used to access OSS-HDFS.
jindocache.oss.bucket.YYY.accessKeySecret
The AccessKey secret that is used to access OSS-HDFS.
jindocache.oss.bucket.YYY.endpoint
The endpoint of OSS-HDFS. Example: cn-hangzhou.oss-dls.aliyuncs.com.
jindocache.oss.bucket.YYY.data.lake.storage.enable
The value is fixed as true.