All Products
Search
Document Center

:Migration Service (MMA 4.0)

Last Updated:Nov 25, 2024

MaxCompute provides MaxCompute Migration Assist 4.0 (MMA 4.0), a data migration tool that enables you to migrate data from various sources, such as Hive and BigQuery, to MaxCompute. This topic describes how to use the MMA 4.0 for data migration.

Feature introduction

MMA is a tool used to migrate data to MaxCompute. The supported data sources include:

  • Hive

  • Databricks

  • BigQuery

The following figure shows the process for migrating data to MaxCompute by using MMA.

image

Migration principleimage

  • Metadata migration:

    1. MMA retrieves metadata from the data source by using metadata APIs, which are called by the BigQuery SDK, Hive Metastore SDK, and DataBricks SDK.

    2. MMA creates MaxCompute DDL statements based on the retrieved metadata and executes these DDL statements in MaxCompute to complete the metadata migration.

  • Data migration:

    • Method 1: Pull mode.

      In pull mode, MMA specifies the objects to be migrated through Spark jobs. These jobs run on MaxCompute, read data from the data source, and write the data to MaxCompute. The following table shows the methods for reading data from different data sources:

      Data source

      Data reading method

      BigQuery

      Read API

      Hive on AWS S3

      S3

      Hive on HDFS

      HDFS

    • Method 2: Push mode.

      In push mode, MMA submits SQL and Spark jobs to the data source. These jobs run on the data source, read data, and write the data to MaxCompute. The following table shows the methods for reading data from different data sources:

      Data source

      Data reading method

      Hive

      Run UDTF, write data to MaxCompute through Tunnel commands and Tunnel SDK.

      Databricks

      Run DataBricks Spark, write data to MaxCompute through Storage API.

MMA terminology

The following diagram shows the MMA architecture.image

Data sources

The objects to be migrated include databases in Hive, projects in BigQuery, and catalogs in Databricks. Different data sources have varying data layers. MMA maps the data layers of these data sources to databases, schemas, and tables. Schema is a property of a table.

Data source

Data layer

Hive

Database and table

BigQuery

Catalog, schema, and table

Databricks

Project, dataset, and table

Tasks and subtasks

MMA migration objects can be a database, multiple tables, or multiple partitions. After you select the objects, MMA generates tasks and subtasks. These tasks contain migration operation configurations, such as the objects and methods. Tasks are divided into subtasks, which are the actual execution units. A subtask corresponds to either a non-partitioned table or multiple partitions of a partitioned table. The subtask execution process includes metadata migration, data migration, and data verification.

Data verification

After migrating data to MaxCompute, MMA performs verification by executing the SELECT COUNT(*) statement on both the source and target to obtain the row count of the object (such as a table or partition) and comparing the results.

Prerequisites

  • A MaxCompute project is created. For more information, see Create a MaxCompute project.

  • The corresponding user is granted the following operation permissions for the target MaxCompute project:

    Object

    Permission

    Project

    List, CreateTable, CreateInstance

    Table

    Describe, Select, Alter, Update, Drop

    Instance

    Read, Write

    For specific authorization methods, see Authorization methods. You can also assign the admin role to the Alibaba Cloud account. The process is as follows:

    1. Log on to the MaxCompute console. In the upper-left corner of the console, select a region.

    2. Click Manage in the Actions column of the target project to access the Project Settings page.

    3. Switch to the Role Permissions tab, click Manage Members in the Actions column of admin role, and assign the admin role to the corresponding user.

  • The following operating environments are prepared:

    Operating system:

    • A Linux operating system is required.

    • System configuration: For fewer than 5 million table partitions, we recommend a configuration with 8 cores and 16 GB of memory. For more than 5 million table partitions, a higher configuration is required.

    JDK version:

    • Java 8 or Java 11.

    MySQL version:

    • V5.7 or later.

Limits

The limits for BigQuery migration are as follows:

  • Data types

    • BIGNUMERIC, JSON, INTERVAL, GEOGRAPHY, and RANGE types are not supported.

    • During migration, the TIME type is converted to STRING type.

  • Table schema

    • The length of table and column names must be less than or equal to 128 characters. Names can only include uppercase and lowercase letters, numbers, and underscores.

    • The maximum number of columns in a table is 1200.

Installation and configuration

  1. Access MMA_Releases and select themma-<3.x.x>-<datasource>.jar file for your data source, such asmma-3.1.0-bigquery.jar.

  2. Download and unzip the lib.zip file.

  3. Create a config.ini file and add the following parameter configurations. An example of the config.ini file is as follows:

    [mysql]
    host = mysql-host
    port = 3306
    ; can be any database name
    db = mmav3 
    username = user
    password = pass
    
    [mma]
    listening_port = 6060

    Parameter descriptions:

    Category

    Parameter name

    Description

    mysql

    host

    The IP address for logging into MySQL.

    port

    The port for the MySQL server, default is 3306.

    db

    The name of the MySQL database.

    username

    The login username for the MySQL database.

    password

    The login password for the MySQL database.

    mma

    listening_port

    The listening port for the MMA server. By default, the port 6060 is used.

  4. Organize the file directory as follows:

    /path/to/mma
    |
    ├── mma-3.1.0-<datasource>.jar            # MMA main program
    ├── config.ini                                  # MMA configuration file
    └── lib/                                       
        └── spark/                                  # Spark migration task dependency directory
            ├── log/                                # Automatically generated during program runtime, spark submission log
            ├── spark-3.1.1-odps0.34.1/             # MaxCompute Spark-3.1.1 dependency
            ├── bigquery-spark-job.jar              # Spark migration task
            └── mma-spark-client.jar                # Spark submission program

Run the program

You can start, access, and stop the MMA program by using commands.

Note

When running the program, you need to replace <datasource> in the command with the corresponding data source type.

Start MMA

We recommend you use nohup to start the program. The command is as follows:

nohup java -jar mma-3.1.0-<datasource>.jar -c config.ini > nohup.log 2>&1 &

Stop MMA

You can use the following command to stop the MMA program. Running migration tasks are interrupted when you stop MMA program. The interrupted tasks are re-executed after restarting the MMA program, without affecting the final results.

ps aux | grep mma-3.1.0-<datasource>.jar | grep -v grep | awk '{print $2}' | xargs kill -9

Access MMA

After starting MMA program, you can access it by entering http://ip_of_the_service:6060 in your browser, where 6060 is the port configured in config.ini. Once MMA is open in the browser, you need to configure the following parameters on the Maxcompute Config page:

Note

MMA does not provide access authentication. Therefore, you need to restrict access to MMA through network policies. For example, if MMA runs on Alibaba Cloud Elastic Compute Service (ECS), you need to control access by using the ECS security group.

Configuration item

Configuration description

mc.endpoint

The endpoint for MMA to access MaxCompute. The server where MMA is located must be able to connect to the endpoint specified by mc.endpoint. For specific endpoint information, see Endpoints.

mc.data.endpoint

Optional. Used for UDTF to access MaxCompute when migrating data through Hive UDTF. The nodes of the Hive cluster must be able to connect to this address.

mc.tunnel.endpoint

Optional. Used for UDTF to access MaxCompute when migrating data through Hive UDTF. The nodes of the Hive cluster must be able to connect to this address.

mc.auth.access.id

The AccessKey ID for accessing MaxCompute.

You can obtain the AccessKey Secret from the AccessKey management page.

mc.auth.access.key

The AccessKey secret for accessing MaxCompute.

mc.default.project

The default MaxCompute project used by MMA to execute SQL commands.

mc.tunnel.quota

Optional. Typically not required.

mc.projects

A comma-separated list of MaxCompute project names to which data is migrated.

task.max.num

The maximum number of concurrent MMA tasks for data migration. This parameter is crucial for adjusting migration speed.

auth.type

Optional. Used only for migrating Hive data through Hive UDTF.

  • BearerToken: used for the public cloud.

  • AK: used for Apsara Stack.

auth.ak.hdfs.path

Optional. When theauth.type value is AK, this is the file path for storing the AccessKey. The file format is as follows:

access_id=xxx
access_key=xxx

spark.dependency.root

The root directory for Spark dependencies.

Migration operations for different data sources

Only BigQuery data migration is supported.

Migrate BigQuery

Add data source

  1. On the Datasources page, click Add Datasource on the right to enter the Add Datasource page.

  2. Select Datasource Type as BIGQUERY, and click Next.

  3. Configure the data source according to the following parameter descriptions.

    Parameter name

    Description

    Name

    A customizable name for the data source, which does not contain special characters other than letters, numbers, and Chinese characters.

    BigQuery service account key file json

    Create a service account in the BigQuery IAM console and download the authentication JSON. For details, see Service account overview.

    image

    image

    BigQuery project name

    The project name authorized by the service account.

    Change Range Partition Table as: 1.Cluster(default), 2.Partition

    The migration method for BigQuery range partition tables. Only migration to MaxCompute partitioned tables is supported.

    The default value is Partition.

    Keep partition of time-unit column-partitioned table as normal column

    For BigQuery time-unit column-partitioned tables, this item specifies whether to retain the partition column as a normal column when saving to MaxCompute.

    The default value is true (to retain).

    Keep partition of ingestion-time partitioned table as normal column

    For BigQuery ingestion-time partitioned tables, this item specifies whether to retain the pseudo column as a normal column when saving to MaxCompute.

    The default value is false (not to retain).

    MaxCompute Spark NetworkLink Name, format is region:vpcId

    The network connection name. For details, see the "Access over a VPC (dedicated connection)" section in Network connection process.

    The configuration format is <regionId>:<vpcId>. For example: If the VPC IDis vpc-uf68aeg673p0k********and the region is Shanghai, then this is configured ascn-shanghai:vpc-uf68aeg673p0k********. For the correspondence between Region and RegionId, see Correspondence between regions and region IDs.

    Meta update timer

    Configure the scheduled pull of data source metadata. Two options:

    1. Daily: The timer runs once a day at a specified minute within 24 hours.

    2. Hourly: The timer runs once an hour at a specified minute within 60 minutes.

    Api concurrent access number of meta

    The concurrency for accessing the data source metadata API. The default value is 10.

    Database whitelist

    The BigQuery databases that need to be migrated. Multiple values are separated by commas.

    Database blacklist

    The BigQuery databases that do not need to be migrated. Multiple values are separated by commas.

    Table blacklist

    The BigQuery database tables that do not need to be migrated. The format for a single table is dbname.tablename. Multiple tables are separated by commas.

    Table whitelist

    The BigQuery database tables that need to be migrated. The format for a single table is dbname.tablename. Multiple tables are separated by commas.

Update data sources

If the data source metadata changes, you need to manually update the data source by clicking Update in the Operation column of the target data source on the Datasources page in MMA.

Modify the data source configuration

You can modify the data source configuration:

  1. On the Datasources page in MMA, find the target data source and click its name to access the details page.

  2. On the Configuration tab, modify relevant configurations of the data source.

Create a migration task

Migrate multiple tables
  1. On the Datasources page in MMA, find the target data source and click its name to access the details page.

  2. On the Data tab, find the target database name and click its name to access the details page.

  3. On the Tables tab, select the tables to be migrated on the table list, and then click New Migration.

  4. In the New Migration Task dialog, configure the parameters and click OK.

    Parameter name

    Description

    Task name

    The task name.

    Type

    The method of migration, which is through the BigQuery Storage Read API.

    MC project

    The target MaxCompute project for data migration.

    MC Schema

    Optional. The schema within the target MaxCompute project. For schema details, see Schema-related operations.

    Table list

    The list of table names to be migrated. Multiple tables are separated by commas.

    Only new partition

    If enabled, the task will skip partitions that have already been successfully migrated.

    Timer

    Configure a timer for the task. Two options:

    1. Daily: The timer runs once a day at a specified minute within 24 hours.

    2. Hourly: The timer runs once an hour at a specified minute within 60 minutes.

    Only schema

    Only create the corresponding tables and partitions in MaxCompute without migrating the data.

    Enable verification

    MMA performs a row count verification by executing select count(*) on the source and target partitions to ensure the success of the data migration.

    Merge partitions

    Typically not needed.

    Partition filter

    For details, see Partition filtering expression description.

    Table name mapping

    The new name for a table after migration to the target MaxCompute project.

    Column name mapping

    The new name for a table column after migration to MaxCompute.

    Table name mapping rule

    Set a name mapping rule by using this configuration. The format is prefix${table}suffix and both prefix and suffix can be empty. For example, if the value ispre_${table}_1and the source table name istest, and then the table name in MaxCompute ispre_test_1after migration.

  5. In the top navigation bar, choose Migration Tasks > Task List to manage the created migration tasks. For more details, see Migration task management.

Migrate a single database
  1. On the Datasources page in MMA, find the target data source and click its name to access the details page.

  2. On the Data tab, find the target database name and click Migrate in the Operation column.

  3. In the New Migration Task dialog, configure the parameters as follows and click OK:

    Parameter name

    Description

    Task name

    The task name.

    Type

    The method of migration, select Storage read api.

    Mc project

    The target MaxCompute project for data migration.

    MC Schema

    Optional. The schema within the target MaxCompute project. For schema details, see Schema-related operations.

    Table whitelist

    The list of table names that need to be migrated.

    Table blacklist

    The list of table names that do not need to be migrated.

    Only new partition

    If enabled, the task will skip partitions that have already been successfully migrated.

    Timer

    Configure a timer for the task. Two options:

    1. Daily: The timer runs once daily at a specified minute within 24 hours.

    2. Hourly: The timer runs hourly at a specified minute within 60 minutes.

    Only schema

    Only create the corresponding tables and partitions in MaxCompute without migrating the data.

    Enable verification

    MMA performs a row count verification by executing select count(*) on the source and target partitions to ensure the success of the data migration.

    Merge partitions

    This option is usually not needed.

    Partition filter

    For details, see Partition filtering expression description.

    Table name mapping

    The new name for a table after migration to the target MaxCompute project.

    Column name mapping

    The new name for a table column after migration to MaxCompute.

    Table name mapping rule

    Set a name mapping rule by using this configuration. The format is prefix${table}suffix and both prefix and suffix can be empty. For example, if the value ispre_${table}_1and the source table name istest, and then the table name in MaxCompute ispre_test_1after migration.

  4. In the top navigation bar, choose Migration Tasks > Task List to manage the created migration tasks. For more details, see Migration task management.

Migration task management

Manage tasks

In the top navigation bar, choose Migration Tasks > Task List to view all tasks. You can perform actions such as Delete, Stop, Start, and Retry on tasks with different statuses.

Manage subtasks

You can manage all subtasks on the Subtask List page.

  1. Access the Subtask List page in two methods:

    • In the top navigation bar, choose Migration Tasks > Subtask List to access the Subtask List page and view all task subtasks.

    • On the Task List page, click a task name in the Name column to access the Subtask List page of the task.

  2. Filter subtasks by Name, Datasource, Source database, Source table, and Status.

  3. Find the target subtask and Click Details in the Operation column to view or download its Task logs. image

  4. For subtasks with the Status of the DataCopyFailed, you can perform Retry or Rerun operations.

    Executing a subtask involves three stages: schema migration, data migration, and data verification. If a subtask fails during the "data migration" stage, the Retry operation re-execute this stage; the Rerun operation restart the subtask from the "schema migration" stage.

References

Partition filtering expression description

The following example explains the partition filtering expression.

For example,p1 >= '2022-03-04' and (p2 = 10 or p3 > 20) and p4 in ('abc', 'cde').

  • p1, p2, and p3 are partition names.

  • Partition values include strings and numbers, with strings enclosed in quotes. Except for INT/BIGINT type partition columns, other partition values are represented as strings.

  • Comparison operators include: >, >=, =, <, <=, <>.

  • The expression supports the IN operator.

  • Logical operators include: AND and OR.

  • Parentheses are supported to alter the precedence of operations.

Correspondence between regions and region IDs

Region

Region ID

China (Hangzhou)

cn-hangzhou

China (Shanghai)

cn-shanghai

China (Beijing)

cn-beijing

China (Zhangjiakou)

cn-zhangjiakou

China (Ulanqab)

cn-wulanchabu

China (Shenzhen)

cn-shenzhen

China (Chengdu)

cn-chengdu

China (Hong Kong)

cn-hongkong

China East 2 Finance

cn-shanghai-finance-1

China North 2 Ali Gov 1

cn-north-2-gov-1

China South 1 Finance

cn-shenzhen-finance-1

Japan (Tokyo)

ap-northeast-1

Singapore

ap-southeast-1

Malaysia (Kuala Lumpur)

ap-southeast-3

Indonesia (Jakarta)

ap-southeast-5

Germany (Frankfurt)

eu-central-1

UK (London)

eu-west-1

US (Silicon Valley)

us-west-1

US (Virginia)

us-east-1

UAE (Dubai)

me-east-1