All Products
Search
Document Center

DataWorks:Batch synchronization from a single EMR Hive table to MaxCompute

Last Updated:Mar 06, 2026

This document uses the scenario of syncing a single E-MapReduce (EMR) Hive table to MaxCompute to demonstrate best practices for configuring a data source, establishing network connectivity, and setting up a batch synchronization node.

Background

Hive can be used to store, query, and analyze large-scale data in Hadoop. The tool maps structured data files to database tables and provides SQL query capabilities by converting SQL statements into MapReduce jobs.

Prerequisites

  • You have purchased a serverless resource group.

  • You have created a Hive data source and a MaxCompute data source. For more information, see Data source configuration.

  • You have established a network connection between the resource group and the data source. For more information, see Network connectivity solutions.

    Note

    If you use a public endpoint to connect an exclusive resource group to EMR, you must configure the Security Group rules for the EMR cluster to allow access from the elastic IP (EIP) address of the exclusive resource group. The Security Group's inbound rules must allow access to the required EMR cluster ports, such as 10000, 9093, and 8020.

Limitations

Syncing source data to MaxCompute external tables is not supported.

Procedure

Step 1: Create a node and configure task

For the general steps to create a node and use the Codeless UI, see Configure a task in the Codeless UI.

Step 2: Configure the data source and destination

Configure the data source (Hive)

This section describes the key parameters for configuring the Hive data source, which is a Hive table in this example.

Parameter

Description

Hive Read Method

  • Read data from HDFS files.: The Hive Reader plugin accesses the Hive Metastore service to parse information about the configured data table, such as its HDFS file storage path, file format, and delimiters. It then reads data directly from the HDFS files.

  • Read data using Hive JDBC (supports conditional filtering).: The Hive Reader plug-in connects to the HiveServer2 service by using the Hive JDBC client to read data. This method allows you to filter data by using a where clause and read data directly by using SQL.

Note

The HDFS method offers higher efficiency. The JDBC method generates a MapReduce program, resulting in lower sync performance. Note that the HDFS method does not support conditional filtering or reading from views. Choose the method that best suits your requirements.

Table

Ensure that the Table Schema is identical in both the development and production environments for the Hive data source.

Note

This section displays the list of tables and the Table Schema from your development environment. If the table definitions differ between your development and production environments, the task might be configured correctly in the development environment but fail in production with errors such as "table not found" or "column not found".

Parquet schema

If the Hive table is stored in the Parquet format, you must configure the corresponding Parquet schema.

Configure the data destination (MaxCompute)

This section describes the key parameters for configuring the MaxCompute Data Destination.

Note

Parameters not described in the following table can be left at their default values.

Parameter

Description

Tunnel Resource Group

The MaxCompute Tunnel Quota used for data transfer. By default, 'Public transport resources' is selected, which is the free quota provided by MaxCompute. If your exclusive Tunnel Quota becomes unavailable due to overdue payments or expiration, the task automatically switches to 'Public transport resources' at runtime.

Table

Select the target MaxCompute table. If you are using a standard DataWorks Workspace, ensure that a MaxCompute table with the same name and a consistent Table Schema exists in both your development and production environments.

You can also click Generate Destination Table Schema. The system automatically creates a table to receive the data. You can manually adjust the table creation statement.

Note

Note the following:

  • If the target MaxCompute table does not exist in the development environment, you will not be able to find it in the destination table drop-down list when configuring the batch synchronization node.

  • If the target MaxCompute table does not exist in the production environment, the Sync Task fails after publishing because the table is not found.

  • If the Table Schema is inconsistent between the development and production environments, the column mapping during the scheduled run may differ from the mapping configured in the batch synchronization node, leading to incorrect data writes.

Partition Information

If the destination table is partitioned, specify the value for the partition column.

  • The value can be a fixed value, such as ds=20220101.

  • The value can be a Scheduling System Parameter, such as ds=${bizdate}. The system automatically replaces the parameter with its scheduled value at runtime.

Write Method

Specifies whether to overwrite or append data in the target table.

Step 3: Configure and validate the task

  • Field Mapping: Typically, you can use Map Fields with the Same Name or Map Fields in the Same Line. If the order or names of the fields in the source and destination differ, you can adjust the mappings manually.

  • Channel Control: Set the Policy for Dirty Data Records to reject any dirty data to ensure data quality. You can initially keep the default values for other parameters.

Step 4: Configure and debug the task

  1. On the right side of the batch synchronization node configuration page, click Run Configuration. Set the Resource Group and Script Parameters for the debug run. Then, click Run in the top toolbar to test if the sync pipeline runs successfully.

  2. In the left-side navigation pane, click image, and then click image to the right of Personal Directory to create a file with a .sql extension. Run the following SQL statement to query the destination table and verify that the data meets expectations.

    Note
    SELECT * FROM <your_maxcompute_destination_table> WHERE pt=<your_partition> LIMIT 20;

Step 5: Configure scheduling and publish the task

On the right side of the batch synchronization node, click Scheduling Settings. Configure the parameters for periodic runs as described in Scheduling Configuration. Then, click Publish in the top toolbar and follow the on-screen instructions to publish the task.