How to Use JindoDistCp for Offline Data Migration to a Data Lake

This article discusses the data lake offline data migration process using JindoDistCp and explains how it improves the migration performance in different scenarios.

A data lake works like a big pool that centrally stores heterogeneous data of all types. As the storage architecture, a data lake uses Alibaba Cloud’s OSS to store data. Enterprises can quickly form their personalized data lake using Alibaba Cloud services and pay according to its capacity. The next important step involves migrating heterogeneous data into the data lake.

In traditional big data practices, users often employ HDFS as the underlying storage of massive heterogeneous data. You can migrate most of these data to the data lake through offline data migration with OSS as the underlying storage. Hadoop’s DistCp is the most common offline data migration and copying tool. However, it cannot fully utilize the features of OSS, resulting in poor efficiency and consistency. Besides, the relatively simple functions of DistCp fail to meet users’ demands. Therefore, an efficient and fully functional offline data migration tool becomes crucial for improving the efficiency of data migration into a data lake.

As a wide application of Alibaba Cloud’s JindoFS SDK, JindoDistCp - a data lake offline data migration tool - is available to all users. It can copy distributed files in and between large-scale clusters developed by the Alibaba Cloud E-MapReduce (EMR) team. Based on MapReduce, JindoDistCp distributes files, handles errors, and recovers data. It takes lists of files and directories as the inputs of map/reduce tasks. Each task will copy a specific part of the files in the source list. JindoDistCp supports data copying among HDFS, S3, and OSS, and provides various customized copying parameters and strategies.

While optimizing data copying from HDFS and S3 to OSS in a data lake, JindoDistCp realizes No-Rename copying and ensures data consistency through customized CopyCommitter. Covering all functions of S3DistCp and HadoopDistCp, JindoDistCp offers better performance than HadoopDistCp. It aims to serve as an efficient, stable, and secure tool for offline data migration to a data lake. This article mainly introduces how to use JindoDistCp for basic offline data migration and how to improve the migration performance in different scenarios. It is worth mentioning that JindoDistCp was previously used only for EMR services. However, it is now available to all users of Alibaba Cloud OSS and HDFS and provides official maintenance and support. We welcome all users to integrate and use JindoDistCp.

HadoopDistCp

HadoopDistCp is a distributed data migration tool of Hadoop. It provides basic file copying, overwritten copying, map parallelism specification, and export path of logs. HadoopDistCp is partially optimized in Hadoop2x, such as the selection of the copying policy and default utilization of uniform size (Each map balances the file size). If dynamic is specified, DynamicInputFormat is applied. These features optimize data copying between common HDFS but lack improvements in data writing for object storage systems, such as OSS.

S3DistCp

S3DistCp is an AWS tool for storage on S3. As a HadoopDistCp extension, S3DistCp functions together with S3. It has some practical features such as incremental copying of data, specifying compression forms during data copying, data aggregating based on data patterns, and copying based on the file list.

JindoDistCp

JindoDistCp is an easy-to-use tool for copying distributed files, mainly used in EMR clusters. It provides data migration services from HDFS and S3 to OSS. JindoDistCp is better than HadoopDistCp and S3DistCp as it has more customized functions. It deeply integrates the features of OSS to customize the CopyCommitter for No-Rename copying. Thus, it reduces the time required for offline data migration to a data lake.

Why Choose JindoDistCp?

1) JindoDistCp provides high efficiency in offline data migration. In test scenarios, it accelerates the offline data migration by 1.59 times.

2) JindoDistCp enables abundant basic functions, multiple copying methods, and various scenario-based optimization strategies.

3) Through deep integration with OSS, JindoDistCp can store migrated files in the archive, low-frequency, and compressed modes without any additional operations.

4) No-Rename copying ensures data consistency.

5) JindoDistCp covers all kinds of scenarios and supports multiple Hadoop versions as a substitute for HadoopDistCp.

Performance Comparison: JindoDistCp vs. HadoopDistCp

The EMR team made a performance comparison test between JindoDistCp and HadoopDistCp. In the offline data migration scenario from HDFS to OSS, the team utilized the built-in testing dataset of Hadoop, TestDFSIO, to respectively generate 1,000 files with the size of 10 MB, 500 MB, and 1 GB. Then, these files were copied from HDFS to OSS for the performance comparison test.

JindoDistCp offers significantly better performance than HadoopDistCp, with a maximum acceleration of 1.59 times in the test scenario.

Toolkit Use

1) Download the jar Package

Click github repo and download the latest jar package jindo-distcp-x.x.x.jar.

Note: As the underlying layer of SDK uses native code, the jar package only works on Linux and macOS operating systems. A new version will be available soon to support more operating systems.

2) Configure AK for Accessing OSS

Users can define Access Key ID (AK) by specifying --ossKey, --ossSecret, and –ossEndPoint in instructions while running the program.

Example:

hadoop jar jindo-distcp-2.7.3.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com

Users can also pre-configure AK, Secret, and Endpoint of OSS in the core-site.xml file of Hadoop to avoid the temporary configuration of AK every time.

<configuration>
    <property>
        <name>fs.jfs.cache.oss-accessKeyId</name>
        <value>xxx</value>
    </property>

    <property>
        <name>fs.jfs.cache.oss-accessKeySecret</name>
        <value>xxx</value>
    </property>

    <property>
        <name>fs.jfs.cache.oss-endpoint</name>
        <value>oss-cn-xxx.aliyuncs.com</value>
    </property>
</configuration>

Moreover, we recommend configuring password-free feature to avoid storing users’ AK in plaintext for improving data security.

Instruction Manual

JindoDistCp provides various practical functions and parameters. The following section entails some of them:

For more information, see JindoDistCp User Guide 9

Community

How to Use JindoDistCp for Offline Data Migration to a Data Lake

HadoopDistCp

S3DistCp

JindoDistCp

Why Choose JindoDistCp?

Performance Comparison: JindoDistCp vs. HadoopDistCp

Toolkit Use

1) Download the jar Package

2) Configure AK for Accessing OSS

Instruction Manual

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Hybrid Cloud Distributed Storage

Data Lake Storage Solution