By Yunhua, Intelligent Product Expert at Alibaba Cloud
The following content is based on the PowerPoint slides of Yunhua's presentation on MaxCompute Migration Assist (MMA). This article covers two parts:
1) Features, Technical Architecture, and Principles of MMA
2) Data Migration Demonstration on MMA
MaxCompute Migration Assist (MMA) is a MaxCompute data migration tool that is used for batch processing, storage, data integration, and job orchestration and scheduling. MMA has a migration evaluation and analysis feature that automatically generates migration evaluation reports, which help you determine data type mapping compatibility issues when synchronizing data from Hive to MaxCompute, such as syntax issues.
MMA supports automatic data migration, batch table creation, and automatic batch data migration. It also provides a job syntax analysis feature to check whether Hive SQL can be run on MaxCompute. In addition, MMA supports workflow migration, job migration and transformation for the mainstream data integration tool Sqoop, and automatic creation of DataWorks data integration jobs.
The following figure displays the MMA architecture. The left side shows the customer's Hadoop cluster, and the right side shows Alibaba Cloud big data services, mainly DataWorks and MaxCompute.
MMA runs on your Hadoop cluster, and your server must be able to access the Hive Server. Post-deployment on a host, the MMA client automatically obtains the Hive metadata. It reads the Hive metadata from MySQL and automatically converts it to MaxCompute Data Definition Language (DDL) statements.
Next, run DDL statements to create tables on MaxCompute in batches, start batch synchronization jobs, and submit concurrent Hive SQL jobs to the Hive Server. You can call a user-defined function (UDF) based on the Hive SQL job. The UDF integrates the Tunnel SDK to write data to MaxCompute tables in batches based on Tunnel. When migrating jobs and workflows, you can check workflow jobs based on the Hive metadata that MMA discovers automatically. This includes batch converting workflow configurations in workflow components to DataWorks workflow configurations for generating DataWorks workflows. After these steps, data is migrated to jobs and workflows. After the migration completion, you need to connect to the business system based on the MaxCompute and DataWorks architectures.
MMA supports the batch migration of data and workflows through the client and server. The MMA client installed on your server provides the following features:
Accordingly, MMA contains four components:
MMA requires JDK V1.6 or later and Python V3 or later, as shown in the following figure. The host that runs MMA submits Hive SQL jobs through the Hive client. The host should be able to access the Hive Server and connect to MaxCompute. The right side of the following figure shows a scenario where the customer found an issue when synchronizing data based on MMA. In this example, you have an IDC and Elastic Compute Service (ECS) instance in Alibaba Cloud and have connected the IDC to Alibaba Cloud through a private line. Before installing MMA, you can directly access MaxCompute from the ECS instance but will fail to access MaxCompute from machines in the IDC. In this case, add a virtual border router (VBR) to the private line, and subsequently, connect the IDC to the ECS instance and even to MaxCompute through the network.
Download the compiled toolkit, as shown in the following figure. You can also download the source code from the GitHub address available on the MMA website, and then compile it locally based on your Hive version.
odps-data-carrier.zippackage locally. After decompression, the following directories are displayed:
odps_ddl_runner(used to batch create tables), and
hive_udtf_sql_runner(used to synchronize data). The libs directory contains the JAR package and library that MMA depends on. The
res/console/bindirectory contains the ODPSCMD tool and the
The local Hive has three databases, of which the
dma_demo database has five tables. These five tables are automatically batch synchronized to MaxCompute. Create a project in the DataWorks console. Open a new command window and connect to the new project by running the local ODPSCMD tool. Decompress the package on the host where MMA Hive has been installed, and access the
odps-data-carrier directory. Run the
bin/meta-carrier -h command to view the parameter description.
-d is used to specify a database to extract metadata. If it is not specified, the metadata of all databases in Hive is pulled.
-o is used to specify the output directory,
-t is used to specify the table, and
-u is used for specifying the URL, which is the address of Hive Metastore. Specify the address when you start the test. The thrift protocol is used for connection because the address is a thrift address. In addition, Hive metadata is stored locally. Therefore, you only need to pull the metadata from the
dma_demo database and use
-o to specify the directory. Tree Meta allows you to view the structure of metadata directories. The
dma_demo directory is generated in the metadata directory, with the same name as the database. The JSON file in the
dma_demo directory describes the metadata of the database. The two tables in
partition_meta are partition tables, and the table in
table_meta is a non-partition table that records the metadata of all table sets.
bin/meta-processor -hcommand to view the parameters.
-iindicates the input parameter while
-oindicates the output directory. The
-iparameter is the output result of the first command, indicating the metadata directory that stores Hive metadata pulled by meta-carrier, such as
-ooutput. Save the execution result of
bin/meta-processorin the output directory. View the tree structure of the output directory. The
dma_demodirectory named after a MaxCompute project is generated in the output directory. The
.sqlfile in the
dma_demois used for batch data migration. The
dma_demodirectory also contains the
odps_ddldirectory, which is used to subsequently batch create tables. The
.sqlfile in the
odps_ddldirectory contains table creation statements.
odps_ddl_runner.pyto create tables and partitions in batches: After generating the DDL statement, create tables in batches. Tables are batch created using the ODPSCMD tool (client tool). The
odps_configfile in the first-level directory of the toolkit contains basic parameters, among which
end_pointare required. After setting the parameters, create tables in batches. Run the
python36 bin/odps_ddl_runner.py -hcommand on the batch table creation tool. The meta-processor automatically generates an input parameter. The odpscmd parameter is not required because the directory is located by default. When creating tables, start the ODPSCMD tool and submit the table creation statements to MaxCompute through the client tool. Run show table to check whether the five tables were created and then check whether the partitions were created. If partitions are created on Hive and MaxCompute, the table structures on the two sides must be the same.
hive_udtf_sql_runner.pyto migrate data: Run the
python36 bin/hive_udtf_sql_runner.pycommand to read SQL statements in the
.sqlfiles in the output directory. Check the parameters in the
python36 bin/hive_udtf_sql_runner.pycommand, where
input_allis used to batch migrate all data in the output directory. If you only want to migrate data from a single table or a single partition, use the
input_single_fileparameter. The parallelism parameter indicates the degree of parallelism (DOP). After data is migrated, check whether the table contains data in MaxCompute. Compare the data in MaxCompute with that in the corresponding table on Hive. If the size is the same, the data in the corresponding table in Hive is the same as that in MaxCompute, indicating that the data was completely migrated.
python36 bin/odps_ddl_runner.pycommand to create a table, specify the output directory, and create tables in batches. Now, you have created the inventory table and its five partitions. If no data exists in the created partitions, use the
input_single_fileparameter to specify the SQL directory of a specified partition (for example, the second partition) as
output/dma_demo/hive_udtf_sql/single_partition/inventory_1.sql. Check the execution result and compare the data in the second partition in Hive with the single partition data migrated to MaxCompute. If the data is consistent, it implies that the migration complete.
MMA V1.0 does not provide workflow migration as a service. Currently, you must use an offline tool. Generate the directory according to the template, as shown in the following figure. If you use open-source components to migrate workflows, you can store the configuration in the corresponding directory according to the template. If you do not use open-source components, such as an in-house workflow scheduling and orchestration service, you can generate workflow data based on the directory structure of the template, compress the data into a ZIP archive, and upload it to DataWorks. At present, MMA V1.0 requires you to package the file into a ZIP archive and upload it. The backend automatically parses and loads it to the DataWorks workflow. When the upload is completed, the DataWorks batch generates MaxCompute tables according to the MaxCompute DDL SQL statements. Then, DataWorks initiates a DataX data synchronization job to complete the batch data migration.
The following figure shows the configurable project description file
project.xml, where project information is customizable, and the workflow description file
workflow.xml, which contains custom workflow parameters. Modify the parameters in the configuration file.
After the MaxCompute DDL creation, the system generates both the DDL SQL statement and the migration evaluation report
report.html. The migration evaluation report is a compatibility report, which specifies whether the mappings between the data structures of Hive tables and the data structures of MaxCompute tables are risky and identifies the level of risk. The report also provides details and warning messages, such as incompatible data types or syntax warnings. Before migration, view this report to evaluate the migration risks.
In this article, you have learned how to synchronize Hive data to MaxCompute using MaxCompute Migration Assist (MMA) on Alibaba Cloud. We have also explored the features, technical architecture, and implementation principles of MMA with a demonstration of the data migration using MMA.
Alibaba Cloud MaxCompute - December 7, 2018
Alibaba Clouder - November 13, 2017
Alibaba EMR - August 28, 2019
Alibaba Cloud MaxCompute - June 24, 2019
Apache Flink Community China - June 28, 2020
Alibaba Clouder - February 25, 2019
This all-in-one omnichannel data solution helps brand merchants formulate brand strategies, monitor brand operation, and increase customer base.Learn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
ApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.Learn More
Alibaba Cloud equips financial services providers with professional solutions with high scalability and high availability features.Learn More
More Posts by Alibaba Cloud MaxCompute