A data lake works like a big pool that centrally stores heterogeneous data of all types. As the storage architecture, a data lake uses Alibaba Cloud’s OSS to store data. Enterprises can quickly form their personalized data lake using Alibaba Cloud services and pay according to its capacity. The next important step involves migrating heterogeneous data into the data lake.
In traditional big data practices, users often employ HDFS as the underlying storage of massive heterogeneous data. You can migrate most of these data to the data lake through offline data migration with OSS as the underlying storage. Hadoop’s DistCp is the most common offline data migration and copying tool. However, it cannot fully utilize the features of OSS, resulting in poor efficiency and consistency. Besides, the relatively simple functions of DistCp fail to meet users’ demands. Therefore, an efficient and fully functional offline data migration tool becomes crucial for improving the efficiency of data migration into a data lake.
As a wide application of Alibaba Cloud’s JindoFS SDK, JindoDistCp - a data lake offline data migration tool - is available to all users. It can copy distributed files in and between large-scale clusters developed by the Alibaba Cloud E-MapReduce (EMR) team. Based on MapReduce, JindoDistCp distributes files, handles errors, and recovers data. It takes lists of files and directories as the inputs of map/reduce tasks. Each task will copy a specific part of the files in the source list. JindoDistCp supports data copying among HDFS, S3, and OSS, and provides various customized copying parameters and strategies.
While optimizing data copying from HDFS and S3 to OSS in a data lake, JindoDistCp realizes No-Rename copying and ensures data consistency through customized CopyCommitter. Covering all functions of S3DistCp and HadoopDistCp, JindoDistCp offers better performance than HadoopDistCp. It aims to serve as an efficient, stable, and secure tool for offline data migration to a data lake. This article mainly introduces how to use JindoDistCp for basic offline data migration and how to improve the migration performance in different scenarios. It is worth mentioning that JindoDistCp was previously used only for EMR services. However, it is now available to all users of Alibaba Cloud OSS and HDFS and provides official maintenance and support. We welcome all users to integrate and use JindoDistCp.
HadoopDistCp is a distributed data migration tool of Hadoop. It provides basic file copying, overwritten copying, map parallelism specification, and export path of logs. HadoopDistCp is partially optimized in Hadoop2x, such as the selection of the copying policy and default utilization of uniform size (Each map balances the file size). If dynamic is specified, DynamicInputFormat is applied. These features optimize data copying between common HDFS but lack improvements in data writing for object storage systems, such as OSS.
S3DistCp is an AWS tool for storage on S3. As a HadoopDistCp extension, S3DistCp functions together with S3. It has some practical features such as incremental copying of data, specifying compression forms during data copying, data aggregating based on data patterns, and copying based on the file list.
JindoDistCp is an easy-to-use tool for copying distributed files, mainly used in EMR clusters. It provides data migration services from HDFS and S3 to OSS. JindoDistCp is better than HadoopDistCp and S3DistCp as it has more customized functions. It deeply integrates the features of OSS to customize the CopyCommitter for No-Rename copying. Thus, it reduces the time required for offline data migration to a data lake.
1) JindoDistCp provides high efficiency in offline data migration. In test scenarios, it accelerates the offline data migration by 1.59 times.
2) JindoDistCp enables abundant basic functions, multiple copying methods, and various scenario-based optimization strategies.
3) Through deep integration with OSS, JindoDistCp can store migrated files in the archive, low-frequency, and compressed modes without any additional operations.
4) No-Rename copying ensures data consistency.
5) JindoDistCp covers all kinds of scenarios and supports multiple Hadoop versions as a substitute for HadoopDistCp.
The EMR team made a performance comparison test between JindoDistCp and HadoopDistCp. In the offline data migration scenario from HDFS to OSS, the team utilized the built-in testing dataset of Hadoop, TestDFSIO, to respectively generate 1,000 files with the size of 10 MB, 500 MB, and 1 GB. Then, these files were copied from HDFS to OSS for the performance comparison test.
JindoDistCp offers significantly better performance than HadoopDistCp, with a maximum acceleration of 1.59 times in the test scenario.
Click github repo and download the latest jar package
Note: As the underlying layer of SDK uses native code, the jar package only works on Linux and macOS operating systems. A new version will be available soon to support more operating systems.
Users can define Access Key ID (AK) by specifying --ossKey, --ossSecret, and –ossEndPoint in instructions while running the program.
hadoop jar jindo-distcp-2.7.3.jar --src /data/incoming/hourly_table --dest oss://yang-hhht/hourly_table --ossKey yourkey --ossSecret yoursecret --ossEndPoint oss-cn-hangzhou.aliyuncs.com
Users can also pre-configure AK, Secret, and Endpoint of OSS in the
core-site.xml file of Hadoop to avoid the temporary configuration of AK every time.
<configuration> <property> <name>fs.jfs.cache.oss-accessKeyId</name> <value>xxx</value> </property> <property> <name>fs.jfs.cache.oss-accessKeySecret</name> <value>xxx</value> </property> <property> <name>fs.jfs.cache.oss-endpoint</name> <value>oss-cn-xxx.aliyuncs.com</value> </property> </configuration>
Moreover, we recommend configuring password-free feature to avoid storing users’ AK in plaintext for improving data security.
JindoDistCp provides various practical functions and parameters. The following section entails some of them:
For more information, see JindoDistCp User Guide 9
Alibaba EMR - June 8, 2021
Alibaba EMR - April 30, 2021
Alibaba Cloud MaxCompute - July 15, 2021
Apache Flink Community China - May 14, 2021
Alibaba EMR - July 9, 2021
Alibaba EMR - April 27, 2021
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
Provides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.Learn More
More Posts by Alibaba EMR