MaxCompute Migration Service (MMS) migrates data from various data sources to MaxCompute. It integrates with the MaxCompute Spark engine to simplify large-scale data migration from self-managed data sources, which reduces configuration complexity and operations and maintenance (O&M) costs.
Overview
Architecture
MaxCompute Migration Service (MMS) migrates both metadata and data.
Metadata migration: MMS retrieves metadata from the data source using metadata APIs, such as Hive Metastore SDK and DataBricks SDK. It then generates MaxCompute Data Definition Language (DDL) statements and executes them in MaxCompute to complete the metadata migration.
Data migration: After the metadata is synchronized, MMS generates and submits one or more Spark jobs that run on MaxCompute based on the migration job configuration. These Spark jobs pull data from the data source and write it to the target tables in MaxCompute. This process is managed by the MMS service, which eliminates the need for Spark job development and O&M.
Migration flow
The following figure illustrates the workflow of MaxCompute Migration Service (MMS). The process includes the following core steps:
Load metadata: After you create a migration job, MMS connects to the external data source to read and load the metadata, such as table schemas and partition information. MMS then stores the metadata in its own database for later use.
Create a migration job: MMS supports three types of migration jobs: full database migration, partial table migration, and partial partition migration. Each migration job is split into multiple subtasks that run concurrently to migrate data.
Transfer data and metadata: Each concurrent subtask independently pulls data from the data source. It first creates the corresponding target table or partition in the destination project and then writes the data.
Verify data (Optional): After the data is migrated, MMS can perform a data verification step. It verifies data integrity by comparing the number of rows in the source and destination tables or partitions.
Glossary
Data source
The object to be migrated, such as one or more Hive databases. Different data sources have different data layers. MMS maps the data layers of different data sources to three layers: Database, Schema, and Table. The schema is a property of the table. The following table describes the data layers.
Data source
Data layer
Hive
Database.TableMaxCompute
Project.Schema.TableorProject.TableThe following table lists the data retrieval APIs used by different data sources:
Data source type
Data retrieval API
MaxCompute
Storage API
SQL
BigQuery
Storage Read API
Hive
HDFS or S3
Databricks
Azure Blob Storage
Databricks JDBC
Migration job
A migration job defines the objects to be migrated, which can be a database, multiple tables, or multiple partitions.
Migration task
After you select the objects to migrate and submit the migration job, MMS splits the job into multiple independent migration tasks based on the configuration. A migration task is the actual unit of execution. The task types include Spark and SQL jobs. Each task can correspond to a non-partitioned table or multiple partitions of a partitioned table. The task execution process includes metadata migration, data migration, and data verification.
Data verification
After data migration is complete, MMS performs verification to ensure data consistency. The verification method involves executing
SELECT COUNT(*)on both the source and destination to compare the number of rows in a table or partition. This verifies data integrity. The verification results are recorded in the task logs.
Migration steps
Step 1: Complete the prerequisites before you use MaxCompute Migration Service (MMS).
Step 2: Prepare the data source that you want to migrate. For more information about how to configure different data sources, see Manage the data source.
Step 3: Create and run a migration job.
Step 4: Use the migration monitoring feature in MMS to view the progress and speed of the migration job.
FAQ
What fees are incurred when I use MMS for data migration?
MaxCompute Migration Service (MMS) is free of charge. The following fees are incurred during data migration:
Computing resource fees at the destination: MMS submits Spark or SQL jobs in a MaxCompute project to perform data migration. These jobs consume MaxCompute computing units (CUs) and are billed based on MaxCompute billing standards. The supported billing methods are pay-as-you-go and subscription.
Network traffic fees: Network connections are required during data migration, which incurs network traffic fees.
Data read fees at the source: During data migration, MMS calls the data retrieval APIs of the source to read data. This may incur data read fees from the source, based on its billing rules.
How do I choose between MMS and DataWorks Data Integration for data migration?
MMS: MMS is suitable for one-time, large-batch data migration.
DataWorks Data Integration: This service is suitable for scheduled and continuous data synchronization and integration. It also supports a wide range of data sources.