MaxCompute provides MaxCompute Migration Assist 4.0 (MMA 4.0), a data migration tool that enables you to migrate data from various sources, such as Hive and BigQuery, to MaxCompute. This topic describes how to use the MMA 4.0 for data migration.
Feature introduction
MMA is a tool used to migrate data to MaxCompute. The supported data sources include:
Hive
Databricks
BigQuery
The following figure shows the process for migrating data to MaxCompute by using MMA.
Migration principle
Metadata migration:
MMA retrieves metadata from the data source by using metadata APIs, which are called by the BigQuery SDK, Hive Metastore SDK, and DataBricks SDK.
MMA creates MaxCompute DDL statements based on the retrieved metadata and executes these DDL statements in MaxCompute to complete the metadata migration.
Data migration:
Method 1: Pull mode.
In pull mode, MMA specifies the objects to be migrated through Spark jobs. These jobs run on MaxCompute, read data from the data source, and write the data to MaxCompute. The following table shows the methods for reading data from different data sources:
Data source
Data reading method
BigQuery
Read API
Hive on AWS S3
S3
Hive on HDFS
HDFS
Method 2: Push mode.
In push mode, MMA submits SQL and Spark jobs to the data source. These jobs run on the data source, read data, and write the data to MaxCompute. The following table shows the methods for reading data from different data sources:
Data source
Data reading method
Hive
Run UDTF, write data to MaxCompute through Tunnel commands and Tunnel SDK.
Databricks
Run DataBricks Spark, write data to MaxCompute through Storage API.
MMA terminology
The following diagram shows the MMA architecture.
Data sources
The objects to be migrated include databases in Hive, projects in BigQuery, and catalogs in Databricks. Different data sources have varying data layers. MMA maps the data layers of these data sources to databases, schemas, and tables. Schema is a property of a table.
Data source | Data layer |
Hive | Database and table |
BigQuery | Catalog, schema, and table |
Databricks | Project, dataset, and table |
Tasks and subtasks
MMA migration objects can be a database, multiple tables, or multiple partitions. After you select the objects, MMA generates tasks and subtasks. These tasks contain migration operation configurations, such as the objects and methods. Tasks are divided into subtasks, which are the actual execution units. A subtask corresponds to either a non-partitioned table or multiple partitions of a partitioned table. The subtask execution process includes metadata migration, data migration, and data verification.
Data verification
After migrating data to MaxCompute, MMA performs verification by executing the SELECT COUNT(*) statement on both the source and target to obtain the row count of the object (such as a table or partition) and comparing the results.
Prerequisites
A MaxCompute project is created. For more information, see Create a MaxCompute project.
The corresponding user is granted the following operation permissions for the target MaxCompute project:
Object
Permission
Project
List, CreateTable, CreateInstance
Table
Describe, Select, Alter, Update, Drop
Instance
Read, Write
For specific authorization methods, see Authorization methods. You can also assign the admin role to the Alibaba Cloud account. The process is as follows:
Log on to the MaxCompute console. In the upper-left corner of the console, select a region.
Click Manage in the Actions column of the target project to access the Project Settings page.
Switch to the Role Permissions tab, click Manage Members in the Actions column of admin role, and assign the admin role to the corresponding user.
The following operating environments are prepared:
Operating system:
A Linux operating system is required.
System configuration: For fewer than 5 million table partitions, we recommend a configuration with 8 cores and 16 GB of memory. For more than 5 million table partitions, a higher configuration is required.
JDK version:
Java 8 or Java 11.
MySQL version:
V5.7 or later.
Limits
The limits for BigQuery migration are as follows:
Data types
BIGNUMERIC, JSON, INTERVAL, GEOGRAPHY, and RANGE types are not supported.
During migration, the TIME type is converted to STRING type.
Table schema
The length of table and column names must be less than or equal to 128 characters. Names can only include uppercase and lowercase letters, numbers, and underscores.
The maximum number of columns in a table is 1200.
Installation and configuration
Access MMA_Releases and select the
mma-<3.x.x>-<datasource>.jarfile for your data source, such asmma-3.1.0-bigquery.jar.Download and unzip the lib.zip file.
Create a config.ini file and add the following parameter configurations. An example of the config.ini file is as follows:
[mysql] host = mysql-host port = 3306 ; can be any database name db = mmav3 username = user password = pass [mma] listening_port = 6060Parameter descriptions:
Category
Parameter name
Description
mysql
host
The IP address for logging into MySQL.
port
The port for the MySQL server, default is 3306.
db
The name of the MySQL database.
username
The login username for the MySQL database.
password
The login password for the MySQL database.
mma
listening_port
The listening port for the MMA server. By default, the port 6060 is used.
Organize the file directory as follows:
/path/to/mma | ├── mma-3.1.0-<datasource>.jar # MMA main program ├── config.ini # MMA configuration file └── lib/ └── spark/ # Spark migration task dependency directory ├── log/ # Automatically generated during program runtime, spark submission log ├── spark-3.1.1-odps0.34.1/ # MaxCompute Spark-3.1.1 dependency ├── bigquery-spark-job.jar # Spark migration task └── mma-spark-client.jar # Spark submission program
Run the program
You can start, access, and stop the MMA program by using commands.
When running the program, you need to replace <datasource> in the command with the corresponding data source type.
Start MMA
We recommend you use nohup to start the program. The command is as follows:
nohup java -jar mma-3.1.0-<datasource>.jar -c config.ini > nohup.log 2>&1 &Stop MMA
You can use the following command to stop the MMA program. Running migration tasks are interrupted when you stop MMA program. The interrupted tasks are re-executed after restarting the MMA program, without affecting the final results.
ps aux | grep mma-3.1.0-<datasource>.jar | grep -v grep | awk '{print $2}' | xargs kill -9Access MMA
After starting MMA program, you can access it by entering http://ip_of_the_service:6060 in your browser, where 6060 is the port configured in config.ini. Once MMA is open in the browser, you need to configure the following parameters on the Maxcompute Config page:
MMA does not provide access authentication. Therefore, you need to restrict access to MMA through network policies. For example, if MMA runs on Alibaba Cloud Elastic Compute Service (ECS), you need to control access by using the ECS security group.
Configuration item | Configuration description |
mc.endpoint | The endpoint for MMA to access MaxCompute. The server where MMA is located must be able to connect to the endpoint specified by mc.endpoint. For specific endpoint information, see Endpoints. |
mc.data.endpoint | Optional. Used for UDTF to access MaxCompute when migrating data through Hive UDTF. The nodes of the Hive cluster must be able to connect to this address. |
mc.tunnel.endpoint | Optional. Used for UDTF to access MaxCompute when migrating data through Hive UDTF. The nodes of the Hive cluster must be able to connect to this address. |
mc.auth.access.id | The AccessKey ID for accessing MaxCompute. You can obtain the AccessKey Secret from the AccessKey management page. |
mc.auth.access.key | The AccessKey secret for accessing MaxCompute. |
mc.default.project | The default MaxCompute project used by MMA to execute SQL commands. |
mc.tunnel.quota | Optional. Typically not required. |
mc.projects | A comma-separated list of MaxCompute project names to which data is migrated. |
task.max.num | The maximum number of concurrent MMA tasks for data migration. This parameter is crucial for adjusting migration speed. |
auth.type | Optional. Used only for migrating Hive data through Hive UDTF.
|
auth.ak.hdfs.path | Optional. When the |
spark.dependency.root | The root directory for Spark dependencies. |
Migration operations for different data sources
Only BigQuery data migration is supported.
Migrate BigQuery
Add data source
On the Datasources page, click Add Datasource on the right to enter the Add Datasource page.
Select Datasource Type as BIGQUERY, and click Next.
Configure the data source according to the following parameter descriptions.
Parameter name
Description
Name
A customizable name for the data source, which does not contain special characters other than letters, numbers, and Chinese characters.
BigQuery service account key file json
Create a service account in the BigQuery IAM console and download the authentication JSON. For details, see Service account overview.


BigQuery project name
The project name authorized by the service account.
Change Range Partition Table as: 1.Cluster(default), 2.Partition
The migration method for BigQuery range partition tables. Only migration to MaxCompute partitioned tables is supported.
The default value is Partition.
Keep partition of time-unit column-partitioned table as normal column
For BigQuery time-unit column-partitioned tables, this item specifies whether to retain the partition column as a normal column when saving to MaxCompute.
The default value is true (to retain).
Keep partition of ingestion-time partitioned table as normal column
For BigQuery ingestion-time partitioned tables, this item specifies whether to retain the pseudo column as a normal column when saving to MaxCompute.
The default value is false (not to retain).
MaxCompute Spark NetworkLink Name, format is region:vpcId
The network connection name. For details, see the "Access over a VPC (dedicated connection)" section in Network connection process.
The configuration format is
<regionId>:<vpcId>. For example: If the VPC IDis vpc-uf68aeg673p0k********and the region is Shanghai, then this is configured ascn-shanghai:vpc-uf68aeg673p0k********. For the correspondence between Region and RegionId, see Correspondence between regions and region IDs.Meta update timer
Configure the scheduled pull of data source metadata. Two options:
Daily: The timer runs once a day at a specified minute within 24 hours.
Hourly: The timer runs once an hour at a specified minute within 60 minutes.
Api concurrent access number of meta
The concurrency for accessing the data source metadata API. The default value is 10.
Database whitelist
The BigQuery databases that need to be migrated. Multiple values are separated by commas.
Database blacklist
The BigQuery databases that do not need to be migrated. Multiple values are separated by commas.
Table blacklist
The BigQuery database tables that do not need to be migrated. The format for a single table is
dbname.tablename. Multiple tables are separated by commas.Table whitelist
The BigQuery database tables that need to be migrated. The format for a single table is
dbname.tablename. Multiple tables are separated by commas.
Update data sources
If the data source metadata changes, you need to manually update the data source by clicking Update in the Operation column of the target data source on the Datasources page in MMA.
Modify the data source configuration
You can modify the data source configuration:
On the Datasources page in MMA, find the target data source and click its name to access the details page.
On the Configuration tab, modify relevant configurations of the data source.
Create a migration task
Migrate multiple tables
On the Datasources page in MMA, find the target data source and click its name to access the details page.
On the Data tab, find the target database name and click its name to access the details page.
On the Tables tab, select the tables to be migrated on the table list, and then click New Migration.
In the New Migration Task dialog, configure the parameters and click OK.
Parameter name
Description
Task name
The task name.
Type
The method of migration, which is through the BigQuery Storage Read API.
MC project
The target MaxCompute project for data migration.
MC Schema
Optional. The schema within the target MaxCompute project. For schema details, see Schema-related operations.
Table list
The list of table names to be migrated. Multiple tables are separated by commas.
Only new partition
If enabled, the task will skip partitions that have already been successfully migrated.
Timer
Configure a timer for the task. Two options:
Daily: The timer runs once a day at a specified minute within 24 hours.
Hourly: The timer runs once an hour at a specified minute within 60 minutes.
Only schema
Only create the corresponding tables and partitions in MaxCompute without migrating the data.
Enable verification
MMA performs a row count verification by executing
select count(*)on the source and target partitions to ensure the success of the data migration.Merge partitions
Typically not needed.
Partition filter
For details, see Partition filtering expression description.
Table name mapping
The new name for a table after migration to the target MaxCompute project.
Column name mapping
The new name for a table column after migration to MaxCompute.
Table name mapping rule
Set a name mapping rule by using this configuration. The format is
prefix${table}suffixand both prefix and suffix can be empty. For example, if the value ispre_${table}_1and the source table name istest, and then the table name in MaxCompute ispre_test_1after migration.In the top navigation bar, choose Migration Tasks > Task List to manage the created migration tasks. For more details, see Migration task management.
Migrate a single database
On the Datasources page in MMA, find the target data source and click its name to access the details page.
On the Data tab, find the target database name and click Migrate in the Operation column.
In the New Migration Task dialog, configure the parameters as follows and click OK:
Parameter name
Description
Task name
The task name.
Type
The method of migration, select Storage read api.
Mc project
The target MaxCompute project for data migration.
MC Schema
Optional. The schema within the target MaxCompute project. For schema details, see Schema-related operations.
Table whitelist
The list of table names that need to be migrated.
Table blacklist
The list of table names that do not need to be migrated.
Only new partition
If enabled, the task will skip partitions that have already been successfully migrated.
Timer
Configure a timer for the task. Two options:
Daily: The timer runs once daily at a specified minute within 24 hours.
Hourly: The timer runs hourly at a specified minute within 60 minutes.
Only schema
Only create the corresponding tables and partitions in MaxCompute without migrating the data.
Enable verification
MMA performs a row count verification by executing
select count(*)on the source and target partitions to ensure the success of the data migration.Merge partitions
This option is usually not needed.
Partition filter
For details, see Partition filtering expression description.
Table name mapping
The new name for a table after migration to the target MaxCompute project.
Column name mapping
The new name for a table column after migration to MaxCompute.
Table name mapping rule
Set a name mapping rule by using this configuration. The format is
prefix${table}suffixand both prefix and suffix can be empty. For example, if the value ispre_${table}_1and the source table name istest, and then the table name in MaxCompute ispre_test_1after migration.In the top navigation bar, choose Migration Tasks > Task List to manage the created migration tasks. For more details, see Migration task management.
Migration task management
Manage tasks
In the top navigation bar, choose Migration Tasks > Task List to view all tasks. You can perform actions such as Delete, Stop, Start, and Retry on tasks with different statuses.
Manage subtasks
You can manage all subtasks on the Subtask List page.
Access the Subtask List page in two methods:
In the top navigation bar, choose Migration Tasks > Subtask List to access the Subtask List page and view all task subtasks.
On the Task List page, click a task name in the Name column to access the Subtask List page of the task.
Filter subtasks by Name, Datasource, Source database, Source table, and Status.
Find the target subtask and Click Details in the Operation column to view or download its Task logs.

For subtasks with the Status of the DataCopyFailed, you can perform Retry or Rerun operations.
Executing a subtask involves three stages: schema migration, data migration, and data verification. If a subtask fails during the "data migration" stage, the Retry operation re-execute this stage; the Rerun operation restart the subtask from the "schema migration" stage.
References
Partition filtering expression description
The following example explains the partition filtering expression.
For example,p1 >= '2022-03-04' and (p2 = 10 or p3 > 20) and p4 in ('abc', 'cde').
p1, p2, and p3 are partition names.
Partition values include strings and numbers, with strings enclosed in quotes. Except for INT/BIGINT type partition columns, other partition values are represented as strings.
Comparison operators include:
>, >=, =, <, <=, <>.The expression supports the
INoperator.Logical operators include:
ANDandOR.Parentheses are supported to alter the precedence of operations.
Correspondence between regions and region IDs
Region | Region ID |
China (Hangzhou) | cn-hangzhou |
China (Shanghai) | cn-shanghai |
China (Beijing) | cn-beijing |
China (Zhangjiakou) | cn-zhangjiakou |
China (Ulanqab) | cn-wulanchabu |
China (Shenzhen) | cn-shenzhen |
China (Chengdu) | cn-chengdu |
China (Hong Kong) | cn-hongkong |
China East 2 Finance | cn-shanghai-finance-1 |
China North 2 Ali Gov 1 | cn-north-2-gov-1 |
China South 1 Finance | cn-shenzhen-finance-1 |
Japan (Tokyo) | ap-northeast-1 |
Singapore | ap-southeast-1 |
Malaysia (Kuala Lumpur) | ap-southeast-3 |
Indonesia (Jakarta) | ap-southeast-5 |
Germany (Frankfurt) | eu-central-1 |
UK (London) | eu-west-1 |
US (Silicon Valley) | us-west-1 |
US (Virginia) | us-east-1 |
UAE (Dubai) | me-east-1 |