Hadoop DistCp (distributed copy) is a tool for data replication between large clusters or within clusters. Hadoop DistCp uses MapReduce to distribute data, fix errors, restore data, and report data during replication. This topic describes the differences between Hadoop DistCp and Jindo DistCp. This topic also describes how to use Hadoop DistCp and the frequently asked questions (FAQ) about Hadoop DistCp.
Differences between Hadoop DistCp and Jindo DistCp
| Type | Description | Scenario |
|---|---|---|
| Hadoop DistCp | The built-in DistCp tool in open source Hadoop. The tool is used to replicate data between large clusters or within clusters. | The tool is applicable to scenarios in which data is replicated between Hadoop Distributed File System (HDFS) clusters. |
| Jindo DistCp | The data migration tool of JindoFS. The tool supports Object Storage Service (OSS), OSS-HDFS, and data sources that are compatible with the API operations of Amazon Simple Storage Service (Amazon S3). |
|
How to use Hadoop DistCp
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/fooFor more information, see DistCp Guide.
FAQ
How do I migrate data between HDFS clusters by using Hadoop DistCp?
Before you use Hadoop DistCp to migrate data between HDFS clusters, make sure that the network connection between the clusters is established. For more information, see E-MapReduce data migration solution.
What do I do if an error message similar to "ACLs not supported on at least one file system" is returned?
Error message: org.apache.hadoop.tools.CopyListing$AclsNotSupportedException: ACLs not supported for file system: hdfs://xx.xx.xx.xx:8020
- Check whether access control lists (ACLs) to be synchronized exist in the source cluster.
If ACLs to be synchronized exist, add the -p parameter after the distcp parameter to grant the synchronization permissions. If a message indicating that a specific cluster does not support ACLs is returned, no ACLs are configured for the cluster. If no ACLs are configured for the destination cluster, you can modify the configurations and restart the NameNode. If the returned message indicates that the source cluster does not support ACLs, no ACLs are configured for the source cluster. Therefore, no ACLs need to be synchronized. In this case, you need to only remove the -a parameter.
- Check whether the values of the dfs.permissions.enabled and dfs.namenode.acls.enabled parameters of the source cluster are the same as those of the destination cluster.
If the values of the parameters of the source and destination clusters are different, change the values of the parameters to be the same for the source and destination clusters, or do not synchronize ACLs.
What do I do if an OOM issue occurs when I run DistCp?
export HADOOP_CLIENT_OPTS="-Xmx1024m"
hadoop distcp /source /target