All Products
Search
Document Center

E-MapReduce:Use the transparent caching feature of JindoCache to accelerate access to OSS

Last Updated:Mar 29, 2024

This topic describes how to use the transparent caching feature of JindoCache to accelerate access to Alibaba Cloud Object Storage Service (OSS). Storage resources of E-MapReduce (EMR) clusters are used to cache OSS objects.

Prerequisites

  • An EMR cluster is created and JindoCache is selected when you create the cluster. For more information, see Create a cluster.

  • OSS is activated. For more information, see Activate OSS.

Limits

Files are stored in OSS as objects.

Procedure

JindoCache uses CacheSets to manage different caching policies. You can select different caching policies for data that is stored in different paths based on your business requirements. JindoCache supports one or more CacheSets.

  1. Select caching policies.

    1. Log on to your cluster. For more information, see Log on to a cluster.

    2. Add the cacheset.xml file.

      In this example, the cacheset.xml file is stored in the /path directory.

      <?xml version="1.0" encoding="UTF-8"?>
      <cachesets>
          <cacheset>
              <name>name1</name>
              <path>oss://emr-test/dir1</path>
              <cacheStrategy>DISTRIBUTED</cacheStrategy>
              <metaPolicy>
                  <type>ALWAYS</type>
              </metaPolicy>
              <readPolicy>CACHE_ASIDE</readPolicy>
              <writePolicy>WRITE_AROUND</writePolicy>
          </cacheset>
          <cacheset>
              <name>name2</name>
              <path>oss://emr-test/dir2</path>
              <cacheStrategy>DHT</cacheStrategy>
              <metaPolicy>
                  <type>ONCE</type>
              </metaPolicy>
              <readPolicy>CACHE_ASIDE</readPolicy>
              <writePolicy>WRITE_AROUND</writePolicy>
          </cacheset>
      </cachesets>

      Parameter

      Description

      Example

      name

      The name of the CacheSet. The name must be unique. If the CacheSet already exists, the existing configurations are overwritten.

      name1

      path

      The parent path to store the CacheSet. Policies that are managed by the CacheSet are used for data in child paths under this parent path.

      oss://emr-test/dir1

      cacheStrategy

      The caching policy. Valid values: DISTRIBUTED and DHT. DHT is short for distributed hash table. The DHT caching policy is used to accelerate access to small-sized, read-only files.

      You can select a policy based on your business requirements.

      DISTRIBUTED

      metaPolicy

      The metadata caching policy. Valid values:

      • ALWAYS: Metadata is not cached and is read from the remote storage.

      • ONCE: Metadata is cached. Metadata is always read from the local storage after the metadata is read from the remote storage for the first time.

      You can select a policy based on your business requirements.

      Note

      If the cacheStrategy parameter is set to DHT, this parameter must be set to ONCE.

      ALWAYS

      readPolicy

      The policy for reading files. This parameter can be set only to CACHE_ASIDE, which indicates that files are preferentially read from the cache.

      CACHE_ASIDE

      writePolicy

      The policy for writing files. Valid values:

      • WRITE_AROUND: writes files to the remote storage.

      • CACHE_ONLY: writes files to the cache.

        Note

        If you set this parameter to CACHE_ONLY, the metaPolicy parameter must be set to ONCE.

      • WRITE_THROUGH: writes files to both the cache and remote storage.

      WRITE_AROUND

    3. Run the following command to refresh CacheSets in JindoCache:

      jindocache -refreshCacheSet -path /path/cacheset.xml

      If the execution is successful, the output contains the Successfully refresh cacheset !!! information. For information about JindoCache-relevant commands, see JindoCache CLI usage notes.

    4. Run the following command to view information about CacheSets in the system:

      jindocache -listCacheSet
  2. Configure JindoSDK.

    Configure the OSS implementation class of JindoCache in Hadoop-Common. Go to the core-site.xml tab on the Configure tab of the Hadoop-Common service page in the EMR console and modify the configuration item that is described in the following table. For more information, see Manage configuration items.

    Configuration item

    Description

    fs.xengine

    The value is fixed as jindocache.

    If you leave this configuration item empty, the client will no longer cache data, but directly communicate with the backend.

    Note

    In this step, the configuration item is configured on the client. The configuration takes effect without the need to restart JindoCache.

    After the preceding configuration is complete, you can run jobs to access OSS by using the caching feature. The transparent caching feature of JindoCache allows you to access OSS without the need to modify job configurations. After you run a job to read data from OSS, the data is automatically cached to JindoCache. Then, when you access the same data, the cache can be hit and the data reading performance can be improved.

FAQ

How do I configure an AccessKey pair that is used to access OSS?

JindoCache allows you to access OSS in password-free mode. If you want to access OSS across accounts, you must configure information, such as the AccessKey ID, AccessKey secret, and endpoint for authorization.

  1. Go to the common tab on the Configure tab of the JindoCache service page.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.

    3. On the EMR on ECS page, find the desired cluster and click Services in the Actions column.

    4. On the Services tab, find the JindoCache service and click Configure.

    5. On the Configure tab, click the common tab.

  2. Add configuration items and make the configurations take effect.

    1. On the common tab, click Add Configuration Item.

    2. In the Add Configuration Item dialog box, add the configuration items that are described in the following table.

      For more information about how to add configuration items and make configurations take effect, see Manage configuration items.

      Note

      XXX specifies the name of an OSS bucket.

      Configuration item

      Description

      jindocache.oss.bucket.XXX.accessKeyId

      The AccessKey ID that is used to access OSS.

      jindocache.oss.bucket.XXX.accessKeySecret

      The AccessKey secret that is used to access OSS. Example: oss-cn-hangzhou-internal.aliyuncs.com.

      jindocache.oss.bucket.XXX.endpoint

      The endpoint of OSS.