Repeated reads from OSS-HDFS incur remote I/O on every access, which increases job latency. JindoCache stores OSS-HDFS objects on EMR cluster storage so that subsequent reads are served from local cache. After the first read, data is cached automatically — Hadoop jobs access OSS-HDFS without any configuration changes.
Prerequisites
Before you begin, make sure that you have:
An EMR cluster with JindoCache selected at cluster creation. For more information, see Create a cluster.
OSS-HDFS enabled with access permissions granted. For more information, see Enable OSS-HDFS and grant access permissions.
How caching works
JindoCache uses CacheSets to apply caching policies to specific OSS paths. Each CacheSet targets a parent path and controls how metadata, reads, and writes are handled for all data under that path. You can define one or more CacheSets, each with a different policy, to match the access patterns of different datasets.
Once you define your CacheSets and configure JindoSDK, caching is fully transparent: Hadoop jobs access OSS-HDFS as usual, and JindoCache handles cache population and lookup automatically.
Step 1: Define caching policies
Create cacheset.xml
Define your CacheSets in a cacheset.xml file. The following example configures two CacheSets for different OSS paths:
<?xml version="1.0" encoding="UTF-8"?>
<cachesets>
<cacheset>
<name>name1</name>
<path>oss://emr-test/dir1</path>
<cacheStrategy>DISTRIBUTED</cacheStrategy>
<metaPolicy>
<type>ALWAYS</type>
</metaPolicy>
<readPolicy>CACHE_ASIDE</readPolicy>
<writePolicy>WRITE_AROUND</writePolicy>
</cacheset>
<cacheset>
<name>name2</name>
<path>oss://emr-test/dir2</path>
<cacheStrategy>DHT</cacheStrategy>
<metaPolicy>
<type>ONCE</type>
</metaPolicy>
<readPolicy>CACHE_ASIDE</readPolicy>
<writePolicy>WRITE_AROUND</writePolicy>
</cacheset>
</cachesets>Each CacheSet maps a parent path to a set of policies. The following table describes all parameters.
| Parameter | Description | Example |
|---|---|---|
name | Unique name for the CacheSet. If a CacheSet with the same name already exists, its configuration is overwritten. | name1 |
path | Parent OSS path. The policies apply to all data under this path. | oss://emr-test/dir1 |
cacheStrategy | The caching policy. Valid values: DISTRIBUTED and DHT. DHT is short for distributed hash table. The DHT caching policy is used to accelerate access to small-sized, read-only files. | DISTRIBUTED |
metaPolicy | Whether to cache file metadata locally. ALWAYS: metadata is not cached and is read from remote storage. ONCE: metadata is cached; metadata is always read from local storage after it is read from remote storage for the first time. If cacheStrategy is DHT, set this to ONCE. | ALWAYS |
readPolicy | How files are read. Must be CACHE_ASIDE: files are preferentially read from the cache. | CACHE_ASIDE |
writePolicy | Where writes go. WRITE_AROUND: writes files to remote storage. WRITE_THROUGH: writes files to both the cache and remote storage. CACHE_ONLY: writes files to the cache only; requires metaPolicy set to ONCE. | WRITE_AROUND |
Apply the CacheSet configuration
Log on to your cluster. For more information, see Log on to a cluster.
Save
cacheset.xmlto a directory on the cluster. In this example, the file is at/path/cacheset.xml.Run the following command to load the CacheSet configuration into JindoCache:
jindocache -refreshCacheSet -path /path/cacheset.xmlA successful run prints:
Successfully refresh cacheset !!!For all available JindoCache CLI commands, see JindoCache CLI usage notes.
Run the following command to verify that the CacheSets are loaded:
jindocache -listCacheSet
Step 2: Configure JindoSDK
Set the OSS-HDFS implementation class in Hadoop-Common so that the client routes traffic through JindoCache. In the EMR console, go to the core-site.xml tab on the Configure tab of the Hadoop-Common service page and add the following configuration item. For more information, see Manage configuration items.
| Configuration item | Value | Default behavior |
|---|---|---|
fs.xengine | jindocache | If left blank, the client communicates directly with OSS-HDFS without caching. |
This configuration applies at the client level and takes effect without the need to restart JindoCache.
After you save the configuration, jobs that read from OSS-HDFS automatically populate the cache on the first read. Subsequent reads of the same data are served from the cache.
FAQ
How do I configure cross-account access to OSS-HDFS?
By default, JindoCache accesses OSS-HDFS without explicit credentials (passwordless mode). For cross-account access, add the following configuration items to the common tab of the JindoCache service page.
Open the configuration tab:
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select the region where your cluster resides and select a resource group.
On the EMR on ECS page, find your cluster and click Services in the Actions column.
Find JindoCache and click Configure.
Click the common tab.
Add the configuration items:
Click Add Configuration Item and add the following items. Replace YYY with the name of the OSS bucket for which OSS-HDFS is enabled. For more information on applying configuration changes, see Manage configuration items.
| Configuration item | Description |
|---|---|
jindocache.oss.bucket.YYY.accessKeyId | The AccessKey ID used to access OSS-HDFS. |
jindocache.oss.bucket.YYY.accessKeySecret | The AccessKey secret used to access OSS-HDFS. |
jindocache.oss.bucket.YYY.endpoint | The OSS-HDFS endpoint. For example: cn-hangzhou.oss-dls.aliyuncs.com. |
jindocache.oss.bucket.YYY.data.lake.storage.enable | Set to true. |