All Products
Search
Document Center

E-MapReduce:Use the transparent caching feature of JindoCache to accelerate access to OSS

Last Updated:Mar 26, 2026

JindoCache transparently caches Object Storage Service (OSS) data on the local storage of E-MapReduce (EMR) cluster nodes. The first time a job reads data from OSS, JindoCache automatically stores a copy in the cluster. Subsequent reads of the same data are served from the local cache—no job configuration changes required.

How it works

JindoCache organizes caching behavior into CacheSets. Each CacheSet binds a caching policy to an OSS path prefix: any data read from that prefix is cached according to the policy you define. You can configure multiple CacheSets to apply different policies to different data sets.

Once you configure a CacheSet and set fs.xengine=jindocache in Hadoop, all jobs that access the covered OSS paths automatically benefit from caching. Jobs don't need to call any cache API—the caching is transparent.

Prerequisites

Before you begin, ensure that you have:

  • An EMR cluster with JindoCache enabled (selected at cluster creation time). For details, see Create a cluster.

  • An activated OSS instance. For details, see Activate OSS.

Usage notes

  • OSS stores files as objects. The terms "file" and "object" refer to the same data when the context involves OSS.

  • If metaPolicy is set to ONCE, JindoCache reads metadata from OSS once and caches it locally. If the underlying OSS data changes after the initial read, jobs may read stale metadata until the cache is refreshed. Use ALWAYS if your data changes frequently and strong consistency is required.

  • Transparent caching provides the most benefit for read-heavy workloads where the same data is accessed multiple times. Write-intensive workloads that rarely re-read the same data gain little from caching.

Configure caching policies

  1. Log on to your cluster. For details, see Log on to a cluster.

  2. Create a cacheset.xml file. The following example stores the file in the /path directory.

    <?xml version="1.0" encoding="UTF-8"?>
    <cachesets>
        <cacheset>
            <name>name1</name>
            <path>oss://emr-test/dir1</path>
            <cacheStrategy>DISTRIBUTED</cacheStrategy>
            <metaPolicy>
                <type>ALWAYS</type>
            </metaPolicy>
            <readPolicy>CACHE_ASIDE</readPolicy>
            <writePolicy>WRITE_AROUND</writePolicy>
        </cacheset>
        <cacheset>
            <name>name2</name>
            <path>oss://emr-test/dir2</path>
            <cacheStrategy>DHT</cacheStrategy>
            <metaPolicy>
                <type>ONCE</type>
            </metaPolicy>
            <readPolicy>CACHE_ASIDE</readPolicy>
            <writePolicy>WRITE_AROUND</writePolicy>
        </cacheset>
    </cachesets>

    The following table describes the parameters.

    ParameterDescriptionExample
    nameA unique name for the CacheSet. If a CacheSet with the same name already exists, its configuration is overwritten.name1
    pathThe OSS path prefix covered by this CacheSet. The policy applies to all objects under this path.oss://emr-test/dir1
    cacheStrategyThe cache distribution strategy. DISTRIBUTED: general-purpose, suitable for most workloads. DHT (distributed hash table): optimized for small-sized, read-only files.DISTRIBUTED
    metaPolicy typeHow JindoCache handles object metadata. ALWAYS: metadata is always read from OSS, providing strong consistency at the cost of higher latency. ONCE: metadata is read from OSS once and cached locally, reducing latency but risking stale reads if OSS data changes. If cacheStrategy is DHT, set this to ONCE.ALWAYS
    readPolicyThe read strategy. Must be CACHE_ASIDE: reads are served from cache when available, and fetched from OSS on a cache miss.CACHE_ASIDE
    writePolicyThe write strategy. WRITE_AROUND: writes go directly to OSS, bypassing the cache—use this for data written once and rarely re-read. WRITE_THROUGH: writes go to both the cache and OSS simultaneously—use this when you need cache consistency with slightly higher write latency. CACHE_ONLY: writes go only to the cache and are not persisted to OSS; requires metaPolicy to be ONCE.WRITE_AROUND
  3. Refresh JindoCache with the new CacheSet configuration.

    jindocache -refreshCacheSet -path /path/cacheset.xml

    If the command succeeds, the output contains Successfully refresh cacheset !!!. For a full list of JindoCache commands, see JindoCache CLI usage notes.

  4. Verify that the CacheSets are registered.

    jindocache -listCacheSet

Enable transparent caching in Hadoop

Set the fs.xengine configuration item in Hadoop-Common so that JindoCache intercepts OSS read requests.

In the EMR console, go to the Hadoop-Common service page, click the Configure tab, and then click the core-site.xml tab. Add or update the following configuration item. For details on managing configuration items, see Manage configuration items.

Configuration itemValueDescription
fs.xenginejindocacheRoutes OSS requests through JindoCache. If left blank, the client reads directly from OSS without caching.

This configuration takes effect immediately—no JindoCache restart is required.

After you save the configuration, all jobs that access OSS paths covered by your CacheSets automatically use the cache on subsequent reads.

FAQ

How do I configure an AccessKey pair for cross-account OSS access?

JindoCache accesses OSS in password-free mode by default. To access an OSS bucket owned by a different account, configure the AccessKey ID, AccessKey secret, and endpoint for that bucket.

  1. In the EMR console, go to the JindoCache service page. In the left-side navigation pane, click EMR on ECS.

  2. In the top navigation bar, select the region where your cluster resides and select a resource group.

  3. On the EMR on ECS page, find your cluster and click Services in the Actions column.

  4. On the Services tab, find JindoCache and click Configure.

  5. On the Configure tab, click the common tab, and then click Add Configuration Item.

  6. In the Add Configuration Item dialog box, add the following configuration items. Replace XXX with the name of the OSS bucket. For details on adding configuration items and applying changes, see Manage configuration items.

    Configuration itemDescription
    jindocache.oss.bucket.XXX.accessKeyIdThe AccessKey ID used to access the bucket.
    jindocache.oss.bucket.XXX.accessKeySecretThe AccessKey secret used to access the bucket.
    jindocache.oss.bucket.XXX.endpointThe endpoint of the bucket (for example, oss-cn-hangzhou-internal.aliyuncs.com).

What's next