Developer Content

With the full release of the Alibaba Cloud JindoFS SDK, Jindo DistCp, an Alibaba Cloud data migration tool based on the JindoFS SDK, is now fully open to users. Jindo DistCp is a distributed file copy tool developed by Alibaba Cloud E-MapReduce team within and between large-scale clusters. It uses MapReduce to realize file distribution, error handling and recovery, and takes the list of files and directories as the input of the map/reduce task. Each task will complete the copy of some files in the source list. Currently, it fully supports data copy scenarios of hdfs ->oss, hdfs ->hdfs, oss ->hdfs, oss ->oss, and provides multiple personalized copy parameters and multiple copy strategies. Focus on optimizing the data copy from hdfs to oss. Through customized CopyCommitter, realize No-Rename copy and ensure the consistency of data copy landing. The function fully aligns S3 DistCp and HDFS DistCp. The performance is greatly improved compared with HDFS DistCp. The goal is to provide efficient, stable and safe data copy tools. This article mainly introduces how to use Jindo DistCp to copy basic files and how to improve data copy performance in different scenarios. It is worth mentioning that before, Jindo DistCp was limited to the internal use of E-MapReduce products. This time, it is open to all Alibaba Cloud OSS/HDFS users in all directions, and provides official maintenance and support technology. Welcome users to integrate and use it.

Big data and data migration tools

In the field of traditional big data, we often use HDFS as the underlying storage and store large-scale data in HDFS. In the scenario of data migration and data copy, the most commonly used tool is the DistCp tool that comes with Hadoop. However, it does not make good use of the characteristics of object storage systems such as OSS, resulting in inefficiency and ultimately failure to ensure consistency. The function options provided are also relatively simple, and cannot meet the needs of users. At this time, an efficient and feature-rich data migration tool becomes an important factor affecting software stack moving and business cloud.

Hadoop DistCp

Hadoop DistCp is a distributed data migration tool integrated with Hadoop. It provides functions such as basic file copy, overlay copy, specifying map parallelism, and log output path. DistCp has been partially optimized on Hadoop2x, such as the selection of copy strategy. By default, uniformize is used (each map will balance the file size). If dynamic is specified, DynamicInputFormat will be used. These functions optimize the data copy between ordinary hdfs, but lack the optimization of data writing for object storage systems such as OSS.

S3 DistCp

S3 DistCp is a distcp tool provided by AWS for S3. S3DistCp is an extension of Hadoop DistCp. It has been optimized so that it can be used with S3, and some practical functions have been added. New features include incremental copying of files, specifying the compression method when copying files, data aggregation based on the mode, and copying according to the file list. S3 DistCp relies on the S3 object storage system, which can only be used inside AWS EMR at present, and is not open to ordinary users.

Jindo DistCp

Jindo DistCp is a simple and easy-to-use distributed file copying tool. Currently, it is mainly used in E-Mapreduce clusters and mainly provides data migration services from hdfs to OSS. Compared with Hadoop DistCp and S3 DistCp, Jindo DistCp has made many optimizations and added many personalized functions. In addition, it deeply combines the characteristics of OSS object storage, customizes the CopyCommitter, realizes No-Rename copy, and greatly reduces the time consumption of data migration on the cloud. Now Jindo DistCp is open to the public. We can use this function to migrate data on the cloud and obtain the OSS data migration tool.

Why use Jindo DistCp?

1. High efficiency, up to 1.59 times acceleration in test scenarios.

2. It has rich basic functions and provides multiple copy methods and scene optimization strategies.

3. In combination with OSS, it provides direct archiving, low frequency, compression and other operations for files.

4. Implement No-Rename copy to ensure data consistency.

5. The scenario is comprehensive, can completely replace Hadoop DistCp, and supports multiple versions of Hadoop (if you have questions, please submit an issue)

How is Jindo DistCp compatible?

Jindo DistCp currently supports Hadoop 2.7+and the latest Hadoop 3. x, provides services in two different jars, relies on the Hadoop environment and does not conflict with Hadoop DistCp. The Jindo DistCp service can be directly provided inside Alibaba Cloud EMR. Users do not need to download the jar package. After downloading the jar package, users can use it by matching the AK of OSS with parameters or Hadoop configuration file.

How much is the performance improvement with Jindo DistCp?

We have made a performance comparison between Jindo DistCp and Hadoop DistCp. In this test, we take hdfs to oss as the main scenario, and use the test data set TestDFSIO provided by Hadoop to generate 1000 files of 10M, 1000 files of 500M, and 1000 files of 1G size respectively for the test process of copying data from hdfs to oss.

By analyzing the test results, it can be seen that Jindo DistCp has a greater performance improvement than Hadoop DistCp, and the maximum acceleration effect can reach 1.59 times in the test scenario.

Use tool kit

1. Download the jar package

Let's go to github repo to download the latest jar package jindo-distcp-x.x.jar

Note: At present, the Jar package only supports Linux and MacOS operating systems, because the underlying SDK uses native code.

2. Configure OSS access AK

You can specify AK in the command by specifying the -- key, -- secret, -- endPoint parameter options during program execution.

Example commands are as follows:

hadoop jar jindo-distcp-2.7.3.jar --src /data/incoming/hourly_ table --dest oss://yang-hhht/hourly_table --key yourkey --secret yoursecret --endPoint oss-cn-hangzhou.aliyuncs.com

You can also pre-configure the ak, secret, and endpoint of OSS in hadoop's core-site.xml file to avoid temporarily filling in ak each time you use it.

fs.jfs.cache.oss-accessKeyId
xxx

fs.jfs.cache.oss-accessKeySecret
xxx

fs.jfs.cache.oss-endpoint
oss-cn-xxx.aliyuncs.com

In addition, we recommend configuring the password-free function to avoid saving the accessKey in clear text and improve security.

Jindo DistCp is fully open for use

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse