JindoFS SDKs V4.0 are compatible with Hadoop FileSystem operations and support buckets that have the HDFS service activated, which improves the performance of Object Storage Service (OSS). JindoFS SDKs enable computing applications based on Apache Hadoop such as MapReduce, Hive, Spark and Flink to use buckets as the default file system without the need to modify or compile code.

Prerequisites

HDFS is activated for buckets when you create the buckets. For more information, see Create buckets.

Step 1: Grant permissions to RAM roles

  1. Authorize the computing application to manage buckets that have HDFS activated.
    The first time you use HDFS you must perform the following operations to authorize a RAM user to manage buckets in the Resource Access Control (RAM) console.
    1. Create a RAM role named AliyunOSSDlsDefaultRole.
      1. Log on to the RAM console.
      2. In the left-side navigation pane, choose Identities > Roles.
      3. Click Create Role, select Alibaba Cloud Service for Trusted entity type, and then click Next.
      4. Set Role Type to Normal Service Role and RAM Role Name to AliyunOSSDlsDefaultRole. Then, select OSS for Select Trusted Service.
      5. Click OK. After the role is created, click Close.
    2. Create a custom policy named AliyunOSSDlsRolePolicy.
      1. In the left-side navigation pane, choose Permissions > Policies.
      2. On the Policies page, click Create Policy.
      3. On the Create Custom Policy page, click JSON and enter the following policy content. Then, click Next Step and set Name to AliyunOSSDlsRolePolicy.
        {
          "Version": "1",
          "Statement": [
            {
              "Effect": "Allow",
              "Action": "oss:*",
              "Resource": [
                "acs:oss:*:*:*/.dlsdata",
                "acs:oss:*:*:*/.dlsdata*"
              ]
            }
          ]
        }
      4. Click OK.
    3. Grant permissions to the RAM role by using the custom policy.
      1. In the left-side navigation pane, choose Identities > Roles.
      2. Click Input and Attach on the right side of the RAM role AliyunOSSDlsDefaultRole.
      3. In the Add Permissions panel, set Type to Custom Policy and Policy Name to AliyunOSSDlsRolePolicy.
      4. Click OK.
  2. Authorize the RAM role to access the bucket that has HDFS activated.
    1. Create a RAM user. For more information, see Create a RAM user.
    2. Create a custom policy and enter the following policy content:
      {
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": "oss:*",
                  "Resource": [
                      "acs:oss:*:*:*/.dlsdata",
                      "acs:oss:*:*:*/.dlsdata*"
                  ]
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "oss:GetBucketInfo",
                      "oss:PostDataLakeStorageFileOperation"
                  ],
                  "Resource": "*"
              }
          ],
          "Version": "1"
      }

      For more information, see Create a custom policy.

    3. Grant permissions to the RAM user. For more information, see Grant permissions to a RAM user.
    If you want to use an existing role such as AliyunEMRDefaultRole to access buckets that have HDFS activated, refer to the preceding steps of how to authorize RAM roles.

Step 2: Download and install the JAR package

  1. To download the latest version of the JindoFS SDK JAR package, visit GitHub.
  2. Run the following command to install the SDK package to the classpath directory of Hadoop.
    cp ./jindosdk-*.jar <HADOOP_HOME>/share/hadoop/hdfs/lib/jindosdk-xxx.jar

Step 3: Configure the DLS implementation class and AccessKey pair

  1. Configure the JindoSDK DLS implementation class in core-site.xml of Hadoop.
    <configuration>
        <property>
            <name>fs.AbstractFileSystem.oss.impl</name>
            <value>com.aliyun.jindodata.dls.DLS</value>
        </property>
    
        <property>
            <name>fs.oss.impl</name>
            <value>com.aliyun.jindodata.dls.JindoDlsFileSystem</value>
        </property>
    </configuration>
  2. Configure the AccessKey ID, AccessKey secret and the endpoint of the bucket that has HDFS enabled in core-site.xml of Hadoop in advance.
    <configuration>
        <property>
            <name>fs.dls.accessKeyId</name>
            <value>xxx</value>
        </property>
    
        <property>
            <name>fs.dls.accessKeySecret</name>
            <value>xxx</value>
        </property>
    
        <property>
            <name>fs.dls.endpoint</name>
            <value>cn-xxx.oss-dls.aliyuncs.com</value>
        </property>
    </configuration>

Step 4: Use Hadoop Shell to access OSS

The following examples introduce several common operations of OSS by running Hadoop Shell commands.

  • Upload a local file to an OSS bucket
    hadoop fs -put <path> oss://<bucket>/

    For example, you can run the following command to upload a local file in the root directory to a bucket named examplebucket:

    hadoop fs -put examplefile.txt oss://examplebucket/
  • View objects and directories in a specified bucket
    hadoop fs -ls oss://<bucket>/

    For example, you can run the following command to view objects and directories in the bucket named examplebucket:

    hadoop fs -ls oss://examplebucket/
  • Create a directory in a specified path of a specified bucket
    hadoop fs -mkdir oss://<bucket>/<path>

    For example, you can run the following command to create a directory named dir/ in the bucket named examplebucket:

    hadoop fs -mkdir oss://examplebucket/dir/
  • Delete data in a specified path of a bucket, including directories and objects in them
    hadoop fs -rm oss://<bucket>/<path>

    For example, you can run the following command to delete a directory named destfolder/ and all objects in the directory in the bucket named examplebucket:

    hadoop fs -rm oss://examplebucket/destfolder/

Step 5: (Optional) Optimize performance

You can add the following configuration items to core-site.xml of Hadoop based on your business requirements. The following configuration items are supported only by JindoFS SDKs V4.0 and later.

<configuration>

    <property>
        <!-- Directories to which the client writes temporary files. You can configure multiple directories and separate the directories with commas (,). Read and write permissions must be granted to multiple users. -->
        <name>client.temp-data-dirs</name>
        <value>/tmp/</value>
    </property>

    <property>
        <!-- The service to automatically clean temporary files. -->
        <name>tmpfile.cleaner.enable</name>
        <value>true</value>
    </property>

    <property>
        <!-- The number of failed retry attempts to access OSS. -->
        <name>fs.dls.retry.count</name>
        <value>5</value>
    </property>

    <property>
        <!-- The timeout period to access OSS. Unit: milliseconds. -->
        <name>fs.dls.timeout.millisecond</name>
        <value>30000</value>
    </property>

    <property>
        <!-- The timeout period to connect to OSS. Unit: milliseconds. -->
        <name>fs.dls.connection.timeout.millisecond</name>
        <value>30000</value>
    </property>

    <property>
        <!-- The number of threads used to upload a single object to OSS. -->
        <name>fs.dls.upload.thread.concurrency</name>
        <value>5</value>
    </property>

    <property>
        <!-- The number of concurrent tasks initiated to upload objects to OSS. -->
        <name>fs.dls.upload.queue.size</name>
        <value>5</value>
    </property>

    <property>
        <!-- The maximum number of concurrent tasks initiated to upload objects to OSS in a process. -->
        <name>fs.dls.upload.max-pending-tasks-per-stream</name>
        <value>16</value>
    </property>

    <property>
        <!-- The number of concurrent tasks initiated to download objects from OSS. -->
        <name>fs.dls.download.queue.size</name>
        <value>5</value>
    </property>

    <property>
        <!-- The maximum number of concurrent tasks initiated to download objects from OSS in a process. -->
        <name>fs.dls.download.thread.concurrency</name>
        <value>16</value>
    </property>

    <property>
        <!-- The size of the buffer used to prefetch data from OSS. -->
        <name>fs.dls.read.readahead.buffer.size</name>
        <value>1048576</value>
    </property>

    <property>
        <!-- The number of buffers to prefetch OSS at the same time. -->
        <name>fs.dls.read.readahead.buffer.count</name>
        <value>4</value>
    </property>

</configuration>