You can migrate data from a self-managed Kudu cluster on your on-premises machine to an E-MapReduce (EMR) cluster in which the Kudu service is deployed. This topic describes the procedure in detail.

Prerequisites

  • A self-managed Kudu cluster is created.
  • An EMR Hadoop cluster is created, and Kudu is selected from the optional services when you create the cluster. For more information, see Create a cluster.

Background information

EMR Kudu supports Apache Kudu 1.10 and 1.11. You can use backup and restore tools provided by Apache Kudu to migrate data. The following figure shows the data migration process. Kudu

Procedure

  1. Run the following command to view the names of the Kudu tables to be migrated:
    Kudu table list {YourKuduMasterAddress}
    Note {YourKuduMasterAddress} specifies the internal IP addresses of the self-managed Kudu cluster. Separate multiple IP addresses with commas (,).
  2. Use the backup tool provided by Apache Kudu to back up tables in the self-managed Kudu cluster.
    Perform incremental or full backups based on your business requirements.
    • Use Object Storage Service (OSS) as intermediate storage
      spark-submit --class org.apache.kudu.backup.KuduBackup kudu-backup2_2.11-1.10.0.jar \
        --kuduMasterAddresses master1-host,master-2-host,master-3-host \
        --rootPath oss://{your_BucketName}/kudu-backups {YourTableList}
      Note {your_BucketName} specifies the name of your OSS bucket. {YourTableList} specifies the list of tables that you want to back up.
    • Use HDFS as intermediate storage
      spark-submit --class org.apache.kudu.backup.KuduBackup kudu-backup2_2.11-1.10.0.jar \
        --kuduMasterAddresses master1-host,master-2-host,master-3-host \
        --rootPath hdfs://{YourHDFSCluster}/kudu-backups {YourTableList}
      Note {YourHDFSCluster} specifies the address of your Hadoop cluster.
  3. Check the data related to the Kudu backup directory.
  4. Run the restore command on the EMR Kudu cluster to import data from the intermediate storage to the EMR Kudu cluster.
    Sample code:
    • OSS used as intermediate storage
      spark-submit --class org.apache.kudu.backup.KuduRestore kudu-backup2_2.11-1.10.0.jar \
        --kuduMasterAddresses master1-host,master-2-host,master-3-host \
        --rootPath oss://{your_BucketName}/kudu-backups {YourTableList}
    • HDFS used as intermediate storage
      spark-submit --class org.apache.kudu.backup.KuduRestore kudu-backup2_2.11-1.10.0.jar \
        --kuduMasterAddresses master1-host,master-2-host,master-3-host \
        --rootPath hdfs://{YourHDFSCluster}/kudu-backups {YourTableList}
  5. Check whether data in tables and names of tables in the EMR Kudu cluster are consistent with data in tables and names of tables that are queried by using a compute engine.