You can migrate data from a self-managed Kudu cluster on your on-premises machine
to an E-MapReduce (EMR) cluster in which the Kudu service is deployed. This topic
describes the procedure in detail.
Prerequisites
- A self-managed Kudu cluster is created.
- An EMR Hadoop cluster is created, and Kudu is selected from the optional services
when you create the cluster. For more information, see Create a cluster.
Background information
EMR Kudu supports Apache Kudu 1.10 and 1.11. You can use backup and restore tools
provided by Apache Kudu to migrate data. The following figure shows the data migration
process.
Procedure
- Run the following command to view the names of the Kudu tables to be migrated:
Kudu table list {YourKuduMasterAddress}
Note {YourKuduMasterAddress}
specifies the internal IP addresses of the self-managed Kudu cluster. Separate multiple
IP addresses with commas (,).
- Use the backup tool provided by Apache Kudu to back up tables in the self-managed
Kudu cluster.
Perform incremental or full backups based on your business requirements.
- Use Object Storage Service (OSS) as intermediate storage
spark-submit --class org.apache.kudu.backup.KuduBackup kudu-backup2_2.11-1.10.0.jar \
--kuduMasterAddresses master1-host,master-2-host,master-3-host \
--rootPath oss://{your_BucketName}/kudu-backups {YourTableList}
Note {your_BucketName}
specifies the name of your OSS bucket. {YourTableList}
specifies the list of tables that you want to back up.
- Use HDFS as intermediate storage
spark-submit --class org.apache.kudu.backup.KuduBackup kudu-backup2_2.11-1.10.0.jar \
--kuduMasterAddresses master1-host,master-2-host,master-3-host \
--rootPath hdfs://{YourHDFSCluster}/kudu-backups {YourTableList}
Note {YourHDFSCluster}
specifies the address of your Hadoop cluster.
- Check the data related to the Kudu backup directory.
- Run the restore command on the EMR Kudu cluster to import data from the intermediate
storage to the EMR Kudu cluster.
Sample code:
- OSS used as intermediate storage
spark-submit --class org.apache.kudu.backup.KuduRestore kudu-backup2_2.11-1.10.0.jar \
--kuduMasterAddresses master1-host,master-2-host,master-3-host \
--rootPath oss://{your_BucketName}/kudu-backups {YourTableList}
- HDFS used as intermediate storage
spark-submit --class org.apache.kudu.backup.KuduRestore kudu-backup2_2.11-1.10.0.jar \
--kuduMasterAddresses master1-host,master-2-host,master-3-host \
--rootPath hdfs://{YourHDFSCluster}/kudu-backups {YourTableList}
- Check whether data in tables and names of tables in the EMR Kudu cluster are consistent
with data in tables and names of tables that are queried by using a compute engine.