All Products
Search
Document Center

E-MapReduce:Migrate data from HDFS to OSS or OSS-HDFS

Last Updated:Nov 24, 2023

This topic describes how to use Jindo DistCp to migrate data from Hadoop Distributed File System (HDFS) to Alibaba Cloud Object Storage Service (OSS) or OSS-HDFS.

Prerequisites

The required environments are prepared and the related tools are downloaded. For more information, see Use Jindo DistCp.

Precautions

By default, JindoSDK 4.4.0 or later uses different domain names to access OSS-HDFS or OSS in different scenarios. The standard internal endpoint of OSS is used by default when you read or write data. If you run the distcp command in a network environment other than an internal network of Alibaba Cloud, you must configure the public endpoint of OSS in the core-site.xml file of Hadoop-Common to access data.

<configuration>
    <property>
        <name>fs.oss.data.endpoint</name>
        <value>oss-cn-xxx.aliyuncs.com</value>
    </property>
</configuration>

Procedure

  1. Log on to the E-MapReduce (EMR) console and specify the AccessKey pair that is used to access OSS or OSS-HDFS.

    Find the Hadoop-Common service of your cluster. On the Configure tab of the Hadoop-Common service page, click core-site.xml and add the configuration items that are described in the following table. This way, you do not need to repeatedly specify the AccessKey pair. For more information about how to add configuration items, see Manage configuration items.

    Configuration item

    Description

    fs.oss.accessKeyId

    The AccessKey ID that is used to access OSS or OSS-HDFS.

    fs.oss.accessKeySecret

    The AccessKey secret that is used to access OSS or OSS-HDFS.

  2. Copy data to OSS or OSS-HDFS.

    1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.

    2. Go to the /opt/apps/JINDOSDK/jindosdk-current/tools directory.

      Obtain the full package name in the jindo-distcp-tool-${version}.jar format. Example: jindo-distcp-tool-4.6.11.jar.

    3. Run the following command to copy the source directory and data in HDFS to OSS or OSS-HDFS:

      hadoop jar jindo-distcp-tool-${version}.jar --src /data --dest oss://destBucket.cn-xxx.oss-dls.aliyuncs.com/dir/ --parallelism 10

      Parameter

      Description

      Example

      --src

      The source path in HDFS.

      /data

      --dest

      The destination path in OSS or OSS-HDFS.

      Note

      If you want to access OSS-HDFS, we recommend that you specify the access path in the oss://<Bucket>.<Endpoint>/<Object> format, such as oss://mydlsbucket.cn-shanghai.oss-dls.aliyuncs.com/Test. JindoSDK allows you to configure the endpoint that is used to access OSS-HDFS by using other methods. For more information, see Configure an endpoint to access OSS-HDFS (JindoFS).

      • OSS: oss://destBucket/

      • OSS-HDFS: oss://destBucket.cn-xxx.oss-dls.aliyuncs.com/

      --parallelism

      The task parallelism. You can adjust the value of this parameter based on the cluster resources.

      10

Advanced operations

  • Copy incremental files.

    If a Jindo DistCp job is interrupted and some files fail to be copied to the destination directory, you can run the --update command to copy these files. If specific files are added to the source directory, you can also run the --update command to copy the incremental files to the destination directory.

    hadoop jar jindo-distcp-tool-${version}.jar --src /data --dest oss://destBucket.cn-xxx.oss-dls.aliyuncs.com/dir/ --update --parallelism 20
  • Specify a YARN queue and a bandwidth.

    Run the following command to specify a YARN queue and a bandwidth for a Jindo DistCp job based on your business requirements:

    hadoop jar jindo-distcp-tool-${version}.jar --src /data --dest oss://destBucket.cn-xxx.oss-dls.aliyuncs.com/dir/ --hadoopConf mapreduce.job.queuename=yarnQueue --bandWidth 100 --parallelism 10
    • --hadoopConf mapreduce.job.queuename: the name of the YARN queue.

    • --bandWidth: the bandwidth for a single Elastic Compute Service (ECS) instance, in MB/s.

References

If issues occur when you use Jindo DistCp, refer to FAQ about Jindo DistCp for troubleshooting.