All Products
Search
Document Center

Migrate data from a self-managed HDFS cluster to a Lindorm file database

Last Updated: Jul 09, 2021

This topic describes how to import data from an open source Hadoop Distributed File System (HDFS) cluster to a database that is powered by the ApsaraDB for Lindorm (Lindorm) file engine.

Background information

In some scenarios, you may need to import data from a self-managed Hadoop cluster to a Lindorm file database.

Scenarios

Import data from a self-managed Hadoop cluster that runs on an Elastic Compute Service (ECS) instance to a Lindorm file database.

Before you begin

  1. Activate the file engine for your Lindorm instance. For more information, see Activate the file engine service.

  2. Modify the Hadoop configuration. For more information, see Use an open source HDFS clients to access the file engine.

  3. Check the connectivity between the self-managed Hadoop cluster and the Lindorm file database.

    Run the following command on the self-managed Hadoop cluster to test the connectivity of the cluster:

    hadoop fs -ls hdfs://${Instance ID}/

    Replace ${Instance ID} with your Lindorm instance ID. If the files in your Lindorm file database are returned, the Hadoop cluster is connected to the Lindorm file database.

  4. Prepare the migration tool

    You can use the Apache Hadoop distributed copy (DistCp) tool to migrate full data or incremental data from a self-managed Hadoop cluster to a Lindorm file database. For more information about DistCp, see DistCp Guide.

Migrate data from a Hadoop cluster

If the ECS instance on which the self-managed Hadoop cluster is deployed and the file database are in the same virtual private cloud (VPC), you can import data to the file database over the VPC. Run the following command to migrate data:

hadoop distcp  -m 1000 -bandwidth 30 hdfs://oldcluster:8020/user/hive/warehouse  hdfs://${Instance ID}/user/hive/warehouse

In the command, oldcluster specifies the IP address or the domain name of a NameNode in the self-managed Hadoop cluster. ${Instance ID} specifies the Lindorm instance ID. Replace ${Instance ID} with your Lindorm instance ID.

Solutions to common problems

  • The overall amount of time used for migration is based on the size of the data in the self-managed Hadoop cluster and the transmission speed between the self-managed Hadoop cluster and the Linform file engine. If you need to migrate a large amount of data, we recommend that you migrate selected directories to estimate the amount of time that is required to migrate the full data. If you can migrate data only within specified time periods, you can split the entire directory into small directories and migrate them one after one.

  • We recommend that you make sure that your clients do not write data when you migrate full data. During this period, you can enable your clients to write data to both the self-managed Hadoop cluster and the Lindorm file database for data processing. You can also modify your business client configuration to write data only to the Lindorm file database.