Migrate data from a self-managed Kudu cluster to an EMR cluster - E-MapReduce

E-MapReduce (EMR) allows you to migrate data from a self-managed Kudu cluster on your on-premises machine to an EMR cluster. This topic describes how to migrate data from a self-managed Kudu cluster on your on-premises machine to an EMR Hadoop cluster in which the Kudu service is deployed.

Prerequisites

A self-managed Kudu cluster is created.
An EMR Hadoop cluster that contains the Kudu service is created. For more information, see Create a cluster.

Background information

EMR Kudu is compatible with Apache Kudu 1.10 and 1.11. You can use backup and restore tools provided by Apache Kudu to migrate data. The following figure shows the data migration process. kudu

Procedure

Run the following command to view the names of the Kudu tables to be migrated:
```
Kudu table list {YourKuduMasterAddress}
```
Note
{YourKuduMasterAddress} specifies the internal IP address of the master node of the self-managed Kudu cluster. You can specify multiple internal IP addresses. Separate them with commas (,).
Use the backup tool provided by Apache Kudu to back up tables in the self-managed Kudu cluster.
The backup tool performs incremental or full backup based on existing backup information.
Note
The first time you back up specified tables in a self-managed Kudu cluster, all data of the tables is backed up. In subsequent backup scenarios, only incremental data of the tables in the self-manged Kudu cluster is backed up. This saves the storage space and reduces the backup time.
In the following commands, --kuduMasterAddresses specifies the internal IP address of the master node of the self-managed Kudu cluster.
- Use Object Storage Service (OSS) as intermediate storage
```
spark-submit --class org.apache.kudu.backup.KuduBackup kudu-backup2_2.11-1.10.0.jar \
  --kuduMasterAddresses master1-host,master-2-host,master-3-host \
  --rootPath oss://{your_BucketName}/kudu-backups {YourTableList}
```
  Note
  {your_BucketName} specifies the name of your OSS bucket. {YourTableList} specifies the tables that you want to back up.
- Use Hadoop Distributed File System (HDFS) as intermediate storage
```
spark-submit --class org.apache.kudu.backup.KuduBackup kudu-backup2_2.11-1.10.0.jar \
  --kuduMasterAddresses master1-host,master-2-host,master-3-host \
  --rootPath hdfs://{YourHDFSCluster}/kudu-backups {YourTableList}
```
  Note
  {YourHDFSCluster} specifies the address of your Hadoop cluster.
Check data in the Kudu backup directory.

Run the Restore command on the EMR Kudu cluster to import data to the EMR Kudu cluster.

Sample code:

Use OSS as intermediate storage

spark-submit --class org.apache.kudu.backup.KuduRestore kudu-backup2_2.11-1.10.0.jar \
  --kuduMasterAddresses master1-host,master-2-host,master-3-host \
  --rootPath oss://{your_BucketName}/kudu-backups {YourTableList}

Note

In the preceding command, --kuduMasterAddresses specifies the internal IP address of the master node of the self-managed Kudu cluster.

Use HDFS as intermediate storage

spark-submit --class org.apache.kudu.backup.KuduRestore kudu-backup2_2.11-1.10.0.jar \
  --kuduMasterAddresses master1-host,master-2-host,master-3-host \
  --rootPath hdfs://{YourHDFSCluster}/kudu-backups {YourTableList}

Check whether data in tables and names of tables in the EMR Kudu cluster are consistent with data in tables and names of tables that are queried by using a compute engine.