Use the Apache Hadoop distributed copy (DistCp) tool to migrate full or incremental data from a self-managed Hadoop Distributed File System (HDFS) cluster to LindormDFS (LDPS). For more information about DistCp, see the DistCp Guide.
Prerequisites
Before you begin, make sure you have:
LindormDFS activated for your Lindorm instance. See Activate LindormDFS.
The Hadoop configuration updated to point to LindormDFS. See Use open source HDFS clients to connect to and use LindormDFS.
The DistCp tool prepared and available on your self-managed Hadoop cluster.
Verify connectivity
Run the following command on the self-managed Hadoop cluster to verify that it can reach LindormDFS:
hadoop fs -ls hdfs://<instance-id>/Replace <instance-id> with your Lindorm instance ID. If the command lists the files in LindormDFS, the cluster is connected and you can proceed with the migration.
Migrate data to LindormDFS
If the Elastic Compute Service (ECS) instance on which the self-managed Hadoop cluster is deployed and LindormDFS are in the same virtual private cloud (VPC), you can migrate data to LindormDFS over the VPC. Run the following DistCp command to copy data:
hadoop distcp -m 1000 -bandwidth 30 hdfs://oldcluster:8020/user/hive/warehouse hdfs://<instance-id>/user/hive/warehouse| Parameter | Description |
|---|---|
-m 1000 | Number of parallel Map tasks. Increase this value to speed up migration on large clusters; decrease it to reduce load on the source cluster. |
-bandwidth 30 | Bandwidth limit per Map task. |
hdfs://oldcluster:8020/... | Source path. Replace oldcluster with the IP address or domain name of a NameNode in the self-managed Hadoop cluster. |
hdfs://<instance-id>/... | Destination path. Replace <instance-id> with your Lindorm instance ID. |
FAQ
How do I estimate migration time for large datasets?
Migration time depends on the total data size and the network throughput between the self-managed cluster and LindormDFS. Migrate a few representative directories first, measure the time, and extrapolate to estimate the full duration.
If you can only migrate during specific maintenance windows, split the source directory into smaller subdirectories and migrate them in sequence across multiple windows.
How do I handle client writes during full migration?
Stop all client writes to the self-managed cluster before starting a full migration. If stopping writes is not feasible, configure clients to write simultaneously to both the self-managed cluster and LindormDFS during the migration period. Once migration completes, update the client configuration to write only to LindormDFS.