This topic describes how to use Jindo DistCp to migrate data from Hadoop Distributed File System (HDFS) to Object Storage Service (OSS). Jindo DistCp is a data copy tool provided by Alibaba Cloud.

Background information

HDFS is used as the underlying storage for a large amount of data in traditional big data architectures. The DistCp tool provided by Hadoop is used to migrate or copy data in HDFS. However, this tool cannot take advantage of the features of OSS, which results in low efficiency and poor data consistency. In addition, DistCp provides only simple features, which cannot meet user requirements.

Jindo DistCp is used to copy files in a distributed file system, and you can use it to copy files within or between large-scale clusters. Jindo DistCp uses MapReduce to distribute files, handle errors, and restore data. Lists of files and directories are used as the input of the map and reduce tasks. Each task copies some files and directories in the input list. Jindo DistCp allows you to copy data between HDFS DataNodes, between HDFS and OSS, and between OSS buckets. It also provides various custom parameters and policies for data copying.

Compared with Hadoop DistCp, Jindo DistCp has the following advantages in data migration from HDFS to OSS:
  • High efficiency. The data migration speed of Jindo DistCp is 1.59 times faster than that of Hadoop DistCp.
  • Rich basic features. Jindo DistCp provides multiple copy methods and various scenario-based optimization policies.
  • Deep integration with OSS. Jindo DistCp takes advantage of OSS features so that you can perform various operations on files, including file compression and the storage class conversion to Archive.
  • File copying without changing file names. This ensures data consistency.
  • High compatibility. Jindo DistCp is applicable to various scenarios and can be used to replace Hadoop DistCp. Jindo DistCp supports Hadoop 2.7.x and Hadoop 3.x.

Prerequisites

  • If you use a self-managed Elastic Compute Service (ECS) cluster, a Hadoop 2.7.x or Hadoop 3.x environment is available and MapReduce jobs can be run in the environment.
  • If you use Alibaba Cloud E-MapReduce (EMR):
    • For EMR V3.28.0 or Bigfoot 2.7.0 or later, you can run Shell commands to use Jindo DistCp. For more information, see Use Jindo DistCp.
    • For EMR V3.28.0 or Bigfoot 2.7.0 or earlier, you may encounter compatibility issues. In that case, submit a ticket for technical support.

Step 1: Download the JAR package of Jindo DistCp

Step 2: Configure the AccessKey pair used to access OSS

You can configure the AccessKey pair by using one of the following methods:
  • Configure the AccessKey pair by using the related command.

    For example, copy the directory in HDFS to a specified path in OSS and configure --ossKey, --ossSecret, and --ossEndPoint in the following command:

    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming/examplefile --dest oss://examplebucket/example_file --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com
  • Configure the AccessKey pair by using a configuration file.

    Configure --ossKey, --ossSecret, and --ossEndPoint in the core-site.xml file of Hadoop. The following code provides an example on how to configure the AccessKey pair in the configuration file:

    <configuration>
        <property>
            <name>fs.oss.accessKeyId</name>
            <value>xxx</value>
        </property>
    
        <property>
            <name>fs.oss.accessKeySecret</name>
            <value>xxx</value>
        </property>
    
        <property>
            <name>fs.oss.endpoint</name>
            <!-- If you access OSS from an ECS instance, we recommend that you use an internal endpoint of OSS in the oss-cn-xxx-internal.aliyuncs.com format. -->
            <value>oss-cn-xxx.aliyuncs.com</value>
        </property>
    </configuration>
  • Use the password-free feature of JindoFS SDK.

    Use the password-free feature of JindoFS SDK so that you do not need to store your AccessKey pairs in plaintext. This improves data security. For more information, see Use the password-free feature of JindoFS SDK.

Step 3: Migrate or copy data

Jindo DistCp V3.7.3 is used in this example. Replace the version number with your actual version number.

  • Migrate or copy full data.

    The following command provides an example on how to migrate or copy full data from the /data/incoming directory in HDFS to the oss://examplebucket/incoming/ path in OSS:

    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --parallelism 10

    The following table describes the parameters and options in the command.

    Parameter/Option Description Example
    --src The source path of the data to migrate or copy in HDFS. /data/incoming
    --dest The destination path of the data to migrate or copy in OSS. oss://examplebucket/incoming
    --ossKey The AccessKey ID used to access OSS. For information about how to obtain an AccessKey ID, see Obtain an AccessKey pair. LTAI5t7h6SgiLSganP2m****
    --ossSecret The AccessKey secret used to access OSS. For information about how to obtain an AccessKey secret, see Obtain an AccessKey pair. KZo149BD9GLPNiDIEmdQ7dyNKG****
    --ossEndPoint The endpoint of the region in which the bucket is located. For more information about the regions and endpoints of OSS, see Regions and endpoints.
    Notice If you access OSS from an ECS instance, we recommend that you use an internal endpoint of OSS in the oss-cn-xxx-internal.aliyuncs.com format.
    oss-cn-hangzhou.aliyuncs.com
    --parallelism The number of data migration or data copying tasks that can be concurrently run based on the amount of resources in your cluster. 10
  • Migrate or copy incremental data

    If you want to migrate or copy only the incremental data from the source path after full data migration or copying, you can use the --update option.

    The following command provides an example on how to migrate or copy only incremental data from the /data/incoming directory in HDFS to the oss://examplebucket/incoming path in OSS:

    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --update --parallelism 10

    By default, checksum is enabled when you use the --update option. This way, Jindo DistCp compares file names, file sizes, and the checksums of files in the source path and the destination path. If data inconsistency is detected in the preceding items, incremental data migration or copying is automatically started.

    To disable Jindo DistCp from comparing the checksums of files in the source path and the destination path, add the --disableChecksum option to the command. Example:

    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --update --disableChecksum --parallelism 10

Appendix 1: Parameters and options supported by Jindo DistCp

Jindo DistCp provides a variety of parameters and options. You can run the following command to obtain information about the parameters and options:

hadoop jar jindo-distcp-3.7.3.jar --help

The following table describes the parameters and options.

Parameter/Option Description Example
--src The source path of files to copy. --src oss://exampleBucket/sourceDir
--dest The destination path of files to copy. --dest oss://exampleBucket/destDir
--parallelism The number of copying tasks that can be concurrently run. You can set this parameter based on the amount of resources in your cluster. --parallelism 10
--policy The storage class of the files after they are copied to OSS. Valid values:
  • ia: Infrequent Access (IA)
  • archive: Archive
  • ColdArchive: Cold Archive
--policy archive
--srcPattern Specifies that a regular expression is used to filter files to copy. You must use the full path in the regular expression. --srcPattern .*\.log
--deleteOnSuccess Specifies that the files to copy from the source path are deleted after they are copied to the destination path. --deleteOnSuccess
--outputCodec The compression method for the files to copy. Compression codecs available in the current version are gzip, gz, lzo, lzop, and snappy. Jindo DistCp also supports following keywords: none and keep. Default value: keep.
  • none: indicates that the files are saved uncompressed. If the files have been compressed, Jindo DistCp decompresses them.
  • keep: indicates that the files remain compressed. This is the default keyword.
Note If you want to use the Lempel–Ziv–Oberhumer (LZO) compression algorithm in an open source Hadoop cluster, you must install the native library of gplcompression and the Hadoop-LZO package. Otherwise, we recommend that you use other compression methods.
--outputCodec gzip
--srcPrefixesFile The list of files to copy. The files in the list are prefixed with the path specified by the src parameter. --srcPrefixesFile file:///opt/folders.txt
--outputManifest Specifies that a gzip file is generated in the directory specified by the dest parameter to record the information about files that are copied. --outputManifest=manifest-2020-04-17.gz
--requirePreviousManifest Specifies whether this copy operation reads files that are copied by previous copy operations. Valid values:
  • false: indicates that the copy operation does not read files that are copied and directly copies full data.
  • true: indicates that the copy operation reads files that are copied and copies only incremental data.
--requirePreviousManifest=false
--previousManifest Specifies that the copy operation reads the path of the files that are copied and copies only incremental data. --previousManifest=oss://exampleBucket/manifest-2020-04-16.gz
--copyFromManifest Specifies that the files recorded in the manifest file are copied. Usually, --copyFromManifest is used with --previousManifest. --previousManifest oss://exampleBucket/manifest-2020-04-16.gz --copyFromManifest
--groupBy Specifies that regular expressions are used to group files that match specific conditions. --groupBy='.*/([a-z]+).*.txt'
--targetSize The threshold for the file size after the files are grouped. Unit: MB. --targetSize=10
--enableBalancePlan This option is used when files to copy have little difference in size. For example, the files are all larger than 10 GB or all smaller than 10 KB. --enableBalancePlan
--enableDynamicPlan This option is used when files to copy have a significant difference in size. For example, files larger than 10 GB and files smaller than 10 KB need to be copied in a task. --enableDynamicPlan
--enableTransaction This option is used to guarantee the data consistency between jobs. By default, only the data consistency between tasks is guaranteed. --enableTransaction
--diff This option is used to check whether all files are copied and generate a list of files that fail to be copied. --diff
--ossKey The AccessKey ID used to access OSS. --ossKey LTAI5t7h6SgiLSganP2m****
--ossSecret The AccessKey secret used to access OSS. --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG****
--ossEndPoint The endpoint of the region in which the bucket is located. --ossEndPoint oss-cn-hangzhou.aliyuncs.com
--cleanUpPending This option is used to clean up files that are incompletely copied to OSS. It takes some time to clean up the files. --cleanUpPending
--queue The name of a YARN queue. --queue examplequeue1
--bandwidth The bandwidth for the data copying task in MB. --bandwidth 6
--disableChecksum This option is used to disable checksum verification. --disableChecksum
--enableCMS This option is used to enable the alerting feature of CloudMonitor. --enableCMS
--update Specifies that only incremental data is migrated to the destination path. Incremental data refers to data that is added to the source path after the last full data migration. --update
--filters Specifies the path of a file. Data in each line in the file contains a regular expression, which corresponds to files that do not need to be copied or compared. --filters /path/to/filterfile.txt
--tmp Specifies a directory to store temporary files when you use Jindo DistCp. --tmp /data
--overwrite Specifies that objects in the source path overwrite objects with the same names in the mapped destination path. --overwrite
--ignore Specifies that exceptions are ignored during data migration to ensure uninterrupted migration. Errors are reported in the form of Jindo DistCp counters. If the --enableCMS option is used, you receive notifications in a specified form. --ignore

Appendix 2: Sample scenarios

Scenario 1: How do I verify data integrity after data is transmitted to OSS by using Jindo DistCp?

You can use the following methods to verify data integrity:

  • Method 1: Use Jindo DistCp counters

    Parameters included in the information of DistCp counters, such as BYTES_EXPECTED and FILES_EXPECTED, can be used to verify data integrity.

    Example
        JindoDistcpCounter
            BYTES_COPIED=10000
            BYTES_EXPECTED=10000
            FILES_COPIED=11
            FILES_EXPECTED=11
            ...
        Shuffle Errors
            BAD_ID=0
            CONNECTION=0
            IO_ERROR=0
            WRONG_LENGTH=0
            WRONG_MAP=0
            WRONG_REDUCE=0

    The following table describes parameters that may be included in the counters in the preceding example.

    Parameter Description
    BYTES_COPIED The number of bytes that have been copied.
    BYTES_EXPECTED The number of bytes to be copied.
    FILES_COPIED The number of files that have been copied.
    FILES_EXPECTED The number of files to be copied.
    FILES_SKIPPED The number of files that are skipped when only incremental data is copied.
    BYTES_SKIPPED The number of bytes that are skipped when only incremental data is copied.
    COPY_FAILED The number of files that fail to be copied. The alerting feature is triggered when the value is not 0.
    BYTES_FAILED The number of bytes that fail to be copied.
    DIFF_FILES The number of files that are different in the source path and the destination path. The alerting feature is triggered when the value is not 0.
    DIFF_FAILED The number of files that are not properly compared. The number is added to the DIFF_FILE value.
    SRC_MISS The number of files that do not exist in the source path. The number is added to the DIFF_FILES value.
    DST_MISS The number of files that do not exist in the destination path. The number is added to the DIFF_FILES value.
    LENGTH_DIFF The number of files that have identical names but different sizes in the source path and the destination path. The number is added to the DIFF_FILES value.
    CHECKSUM_DIFF The number of files that fail to pass the checksum verification. The number is added to the COPY_FAILED value.
    SAME_FILES The number of files that are identical in the source path and the destination path.
  • Method 2: Use the --diff option

    You can use the --diff option to compare the names and sizes of files in the source path and the destination path. Example:

    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --diff

Scenario 2: Which parameters can I configure to resume data migration from HDFS to OSS in the case of a migration failure?

  1. You can use the --diff option to check whether all files in the specified path in HDFS are migrated to OSS.
    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --diff

    If the following command output is displayed, all files have been migrated. Otherwise, a manifest file is generated in the current working directory.

    INFO distcp.JindoDistCp: Jindo DistCp job exit with 0
  2. You can use the --copyFromManifest and --previousManifest options to migrate the files listed in the manifest file.
    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --previousManifest=file:///opt/manifest-2020-04-17.gz --copyFromManifest --parallelism 20

    In this case, file:///opt/manifest-2020-04-17.gz specified by --previousManifest is the local path in which the command is run.

Scenario 3: Which parameters can I configure to specify the storage class of files to migrate to OSS as IA, Archive, or Cold Archive?

You can use the --policy option to specify the storage class of files to migrate to OSS. The IA storage class is specified in the following example:
hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --policy ia --parallelism 10

To set the storage class to Archive for files to migrate to OSS, replace --policy ia with --policy archive. To set the storage class to Cold Archive for files to migrate to OSS, replace --policy ia with --policy coldArchive. Cold Archive is supported only by some regions. For more information, see Cold Archive.

Scenario 4: I have a clear understanding of data distribution (such as the proportion of large files to small files) in the source path. Which parameters can I configure to speed up data transmission?

  • Files to migrate include a large number of small files and a small number of extra large files.

    For example, files in the source path in HDFS include 500,000 files of 100 KB and 10 files of 5 TB. In this case, you can use the --enableDynamicPlan option to speed up data transmission.

    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --enableDynamicPlan --parallelism 10
  • Files to migrate have little difference in size.

    For example, files in the source path in HDFS include 100 files of 200 KB. In this case, you can use the --enableBalancePlan option to speed up data transmission.

    hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --enableBalancePlan --parallelism 10
Note --enableDynamicPlan and --enableBalancePlan cannot be used together.

Scenario 5: After data is migrated or copied, which parameters can I configure to delete specific data from the source path?

You can use the --deleteOnSuccess option to delete specific data from the source path after the data is migrated or copied to the destination path.

hadoop jar jindo-distcp-3.7.3.jar --src /data/incoming --dest oss://examplebucket/incoming --ossKey LTAI5t7h6SgiLSganP2m**** --ossSecret KZo149BD9GLPNiDIEmdQ7dyNKG**** --ossEndPoint oss-cn-hangzhou.aliyuncs.com --deleteOnSuccess --parallelism 10