All Products
Search
Document Center

Object Storage Service:Connect non-EMR clusters to OSS-HDFS

Last Updated:Apr 09, 2024

OSS-HDFS (JindoFS) is fully compatible with Hadoop Distributed File System (HDFS) API operations and supports directory-level operations. JindoSDK allows Apache Hadoop-based computing and analysis applications, such as MapReduce, Hive, Spark, and Flink, to access HDFS. This topic describes how to deploy JindoSDK on an Elastic Compute Service (ECS) instance and then perform basic operations related to OSS-HDFS.

Prerequisites

Video tutorial

The following video provides an example on how to connect non-EMR clusters to OSS-HDFS and perform common operations.

Procedure

  1. Connect to the ECS instance. For more information, see Connect to an instance.

  2. Download and decompress the JindoSDK JAR package. For more information, visit GitHub.

  3. Decompress the JindoSDK JAR package.

    The following sample code provides an example on how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use another version of JindoSDK, replace the package name with the name of the corresponding JAR package.

    tar zxvf jindosdk-x.x.x-linux.tar.gz
    Note

    x.x.x indicates the version number of the JAR package.

  4. Configure environment variables.

    1. Configure JINDOSDK_HOME.

      The following sample code provides an example on how to decompress the package to the /usr/lib/jindosdk-x.x.x-linux directory:

      export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux
    2. Configure HADOOP_CLASSPATH.

      export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${JINDOSDK_HOME}/lib/*
      Important

      Specify the installation directory of the package and configure environment variables on all required nodes.

  5. Configure the implementation class of OSS-HDFS and specify the AccessKey pair that you want to use to access the bucket.

    1. Run the following command to go to the Hadoop configuration file core-site.xml:

      vim /usr/local/hadoop/etc/hadoop/core-site.xml
    2. Configure the JindoSDK DLS implementation class in the core-site.xml file.

      <configuration>
          <property>
              <name>fs.AbstractFileSystem.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOSS</value>
          </property>
      
          <property>
              <name>fs.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
          </property>
      </configuration>
    3. Configure the AccessKey ID and AccessKey secret that are used to access the bucket for which OSS-HDFS is enabled in the core-site.xml file.

      <configuration>
          <property>
              <name>fs.oss.accessKeyId</name>
              <value>xxx</value>
          </property>
      
          <property>
              <name>fs.oss.accessKeySecret</name>
              <value>xxx</value>
          </property>
      </configuration>
  6. Configure the endpoint of OSS-HDFS.

    You must configure the endpoint when you use OSS-HDFS to access buckets in Object Storage Service (OSS). We recommend that you configure the access path in the following format: oss://<Bucket>.<Endpoint>/<Object>. Example: oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt. After you configure the access path, JindoSDK accesses the corresponding OSS-HDFS operation based on the specified endpoint in the access path.

    You can also configure the endpoint of OSS-HDFS by using other methods. The endpoints configured by using different methods take effect in a specific order of precedence. For more information, see the Appendix 1: Other methods used to configure the endpoint of OSS-HDFS section of this topic.

  7. Run HDFS Shell commands to perform common operations that are related to OSS-HDFS.

    • Upload local files

      Run the following command to upload a local file named examplefile.txt in the local root directory to a bucket named examplebucket:

      hdfs dfs -put examplefile.txt oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/
    • Create directories

      Run the following command to create a directory named dir/ in a bucket named examplebucket:

      hdfs dfs -mkdir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/
    • Query objects or directories

      Run the following command to query the objects or directories in a bucket named examplebucket:

      hdfs dfs -ls oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/
    • Query the size of objects or directories

      Run the following command to query the size of all objects or directories in a bucket named examplebucket:

      hdfs dfs -du oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/
    • Query the content of an object

      Run the following command to query the content of an object named localfile.txt in a bucket named examplebucket:

      hdfs dfs -cat oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/localfile.txt
      Important

      The content of the queried object is displayed on the screen in plain text. If the content is encoded, use the HDFS API for Java to read and decode the content.

    • Copy objects or directories

      Run the following command to copy the root directory named subdir1 in a bucket named examplebucket to a directory named subdir2 in the same bucket. In addition, the position of the subdir1 root directory, the objects in the subdir1 root directory, and the structure and content of subdirectories in the subdir1 root directory remain unchanged.

      hdfs dfs -cp oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/subdir1  oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/subdir2/subdir1
    • Move objects or directories

      Run the following command to move the root directory named srcdir in a bucket named examplebucket and the objects and subdirectories in the root directory to another root directory named destdir:

      hdfs dfs -mv oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/srcdir  oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/destdir
    • Download objects

      Run the following command to download an object named exampleobject.txt from a bucket named examplebucket to the root directory named /tmp on your computer:

      hdfs dfs -get oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt  /tmp/
    • Delete objects or directories

      Run the following command to delete a directory named destfolder/ and all objects in the directory from a bucket named examplebucket:

      hdfs dfs -rm -r oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/destfolder/

Appendix 1: Other methods used to configure the endpoint of OSS-HDFS

Apart from the preceding method used to configure the endpoint in the access path, you can use the following methods to configure the endpoint:

  • Use bucket-level endpoints

    If you use the access path in the oss://<Bucket>/<Object> format, no endpoint is configured. In this case, you can configure a bucket-level endpoint in the core-site.xml configuration file of Hadoop to point to the endpoint of OSS-HDFS.

    <configuration>
        <property>
            <!-- In this example, examplebucket is used as the name of the bucket for which OSS-HDFS is enabled. Specify your actual bucket name.  -->
            <name>fs.oss.bucket.examplebucket.endpoint</name>
            <!-- In this example, the China (Hangzhou) region is used. Specify your actual region.  -->
            <value>cn-hangzhou.oss-dls.aliyuncs.com</value>
        </property>
    </configuration>
  • Use the default OSS endpoint

    If you use the access path in the oss://<Bucket>/<Object> format and do not specify a bucket-level endpoint in the access path, the default OSS endpoint is used to access OSS-HDFS. Run the following code to configure the default OSS endpoint in the core-site.xml configuration file of Hadoop:

    <configuration>
        <property>
            <name>fs.oss.endpoint</name>
            <!-- In this example, the China (Hangzhou) region is used. Specify your actual region.  -->
            <value>cn-hangzhou.oss-dls.aliyuncs.com</value>
        </property>
    </configuration>
Note

If you configure the endpoint by using more than one method, the configurations take effect in the following order of precedence: the endpoint specified in the access path > the bucket-level endpoint > the default OSS endpoint.

Appendix 2: Performance tuning

You can add the following configuration items to the core-site.xml file of Hadoop based on your requirements. Only JindoSDK 4.0 and later support these configuration items.

<configuration>

    <property>
          <!-- Directories to which the client writes temporary files. You can configure multiple directories that are separated by commas (,). Read and write permissions must be granted in environments that involve multiple users. -->
        <name>fs.oss.tmp.data.dirs</name>
        <value>/tmp/</value>
    </property>

    <property>
          <!-- The number of retries on failed access to OSS. -->
        <name>fs.oss.retry.count</name>
        <value>5</value>
    </property>

    <property>
          <!-- The timeout period to access OSS. Unit: milliseconds. -->
        <name>fs.oss.timeout.millisecond</name>
        <value>30000</value>
    </property>

    <property>
          <!-- The timeout period to connect to OSS. Unit: milliseconds. -->
        <name>fs.oss.connection.timeout.millisecond</name>
        <value>3000</value>
    </property>

    <property>
          <!-- The number of concurrent threads that are used to upload a single object to OSS. -->
        <name>fs.oss.upload.thread.concurrency</name>
        <value>5</value>
    </property>

    <property>
          <!-- The number of concurrent tasks that are initiated to upload objects to OSS. -->
        <name>fs.oss.upload.queue.size</name>
        <value>5</value>
    </property>

    <property>
          <!-- The maximum number of concurrent tasks that are initiated to upload objects to OSS in a process. -->
        <name>fs.oss.upload.max.pending.tasks.per.stream</name>
        <value>16</value>
    </property>

    <property>
          <!-- The number of concurrent tasks that are initiated to download objects from OSS. -->
        <name>fs.oss.download.queue.size</name>
        <value>5</value>
    </property>

    <property>
          <!-- The maximum number of concurrent tasks that are initiated to download objects from OSS in a process. -->
        <name>fs.oss.download.thread.concurrency</name>
        <value>16</value>
    </property>

    <property>
          <!-- The size of the buffer that is used to prefetch data from OSS. -->
        <name>fs.oss.read.readahead.buffer.size</name>
        <value>1048576</value>
    </property>

    <property>
          <!-- The number of buffers that are used to prefetch data from OSS at the same time. -->
        <name>fs.oss.read.readahead.buffer.count</name>
        <value>4</value>
    </property>

</configuration>