Data migration from Hive to MaxCompute & Kafka Database Synchronization

How to Synchronize Data from Hive to MaxCompute?

In this post, you'll learn how to use Alibaba Cloud's MaxCompute Migration Assist (MMA) to synchronize Hive data to MaxCompute. With a presentation of data migration using MMA, we will also look at the functionality, technological design, and application concepts of MMA.

1) Features, Technical Architecture, and Principles of MMA

1.1 MMA Features

MaxCompute Migration Assist (MMA) is a MaxCompute data migration tool that is used for batch processing, storage, data integration, and job orchestration and scheduling. MMA has a migration evaluation and analysis feature that automatically generates migration evaluation reports, which help you determine data type mapping compatibility issues when synchronizing data from Hive to MaxCompute, such as syntax issues.

MMA supports automatic data migration, batch table creation, and automatic batch data migration. It also provides a job syntax analysis feature to check whether Hive SQL can be run on MaxCompute. In addition, MMA supports workflow migration, job migration and transformation for the mainstream data integration tool Sqoop, and automatic creation of DataWorks data integration jobs.

MMA features

1.2 MMA Architecture

The following figure displays the MMA architecture. The left side shows the customer's Hadoop cluster, and the right side shows Alibaba Cloud big data services, mainly DataWorks and MaxCompute.

MMA runs on your Hadoop cluster, and your server must be able to access the Hive Server. Post-deployment on a host, the MMA client automatically obtains the Hive metadata. It reads the Hive metadata from MySQL and automatically converts it to MaxCompute Data Definition Language (DDL) statements.

Next, run DDL statements to create tables on MaxCompute in batches, start batch synchronization jobs, and submit concurrent Hive SQL jobs to the Hive Server. You can call a user-defined function (UDF) based on the Hive SQL job. The UDF integrates the Tunnel SDK to write data to MaxCompute tables in batches based on Tunnel. When migrating jobs and workflows, you can check workflow jobs based on the Hive metadata that MMA discovers automatically. This includes batch converting workflow configurations in workflow components to DataWorks workflow configurations for generating DataWorks workflows. After these steps, data is migrated to jobs and workflows. After the migration completion, you need to connect to the business system based on the MaxCompute and DataWorks architectures.

MMA architecture

1.3 Technical Architecture and Principles of MMA Agent

MMA supports the batch migration of data and workflows through the client and server. The MMA client installed on your server provides the following features:

Automatically obtain the Hive metadata
Generate DDL and user-defined table function (UDTF) statements
Create tables in batches and migrate Hive data in batches

Accordingly, MMA contains four components:

Meta Carrier automatically extracts the Hive metadata and generates a Hive Metastore structure locally.
Meta Processor batch converts Hive metadata into MaxCompute DDL statements based on the results generated by Meta Carrier, including the table creation statements and data type conversion statements.
The built-in ODPS Console component allows you to batch create MaxCompute tables by using the MaxCompute DDL statements generated by Meta Processor.
Finally, the Data Carrier batch creates Hive SQL jobs. Each Hive SQL job is equivalent to the concurrent data synchronization of multiple tables or partitions.

Technical Architecture and Principles of MMA Agent

How to Synchronize Data from Message Queue for Apache Kafka to MaxCompute?

In this post, you'll learn how to synchronize Message Queue for Apache Kafka to MaxCompute on Alibaba Cloud and obtain a general understanding of Message Queue for Apache Kafka. We'll also go into the different configuration approaches and implementation operations that are involved from conception to launch.

Background

1. Objective

For daily operations, many enterprises use Message Queue for Apache Kafka to collect the behavior logs and business data generated by apps or websites and then process them offline or in real-time. Generally, the logs and data are delivered to MaxCompute for modeling and business processing to obtain user features, sales rankings, and regional order distributions, and the data is displayed in data reports.

2. Solutions
There are two ways to synchronize data from Message Queue for Apache Kafka to DataWorks. In one process, business data and behavior logs are uploaded to Datahub through Message Queue for Apache Kafka and Flume, then transferred to MaxCompute, and finally displayed in Quick BI. In the second process, business data and action logs are transferred through Message Queue for Apache Kafka, DataWorks, and MaxCompute, and finally displayed in Quick BI.

In this a, I will use the second process. Synchronize data from DataWorks to MaxCompute using one of two solutions: custom resource groups or exclusive resource groups. Custom resource groups are used to migrate data to the cloud on complex networks. Exclusive resource groups are used when integrated resources are insufficient.

solutions

MaxCompute Learning Path

Alibaba Cloud MaxCompute is a big data processing platform that processes and stores massive batch structural data to provide effective warehousing solutions. Start your MaxCompute journey here to discover infinite possibilities with Alibaba Cloud.

Related Products

MaxCompute

MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

Community

Data migration from Hive to MaxCompute & Kafka Database Synchronization

How to Synchronize Data from Hive to MaxCompute?

1) Features, Technical Architecture, and Principles of MMA

How to Synchronize Data from Message Queue for Apache Kafka to MaxCompute?

Background

MaxCompute Learning Path

Related Products

MaxCompute

Related Documentation

Select tools to migrate data to MaxCompute

Data migration overview

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

MaxCompute

DataWorks