All Products
Search
Document Center

DataWorks:Batch synchronization from a single EMR Hive table to MaxCompute

Last Updated:Mar 26, 2026

This guide walks you through configuring a batch synchronization node to copy a single E-MapReduce (EMR) Hive table to MaxCompute — covering data source setup, network connectivity, node configuration, and scheduling.

Limitations

Syncing source data to MaxCompute external tables is not supported.

Prerequisites

Before you begin, make sure you have:

  • A serverless resource group

  • A Hive data source and a MaxCompute data source — see Data source configuration

  • Network connectivity between the resource group and the data source — see Network connectivity solutions

    Note

    If you use a public endpoint to connect an exclusive resource group to EMR, configure the security group rules for the EMR cluster to allow inbound access from the elastic IP address (EIP) of the exclusive resource group. The inbound rules must open ports 10000, 9093, and 8020.

Step 1: Create a node and configure a task

For the general steps to create a node and use the Codeless UI, see Configure a task in the Codeless UI.

Step 2: Configure the data source and destination

Configure the data source (Hive)

Choose a read method

Two read methods are available. Choose based on whether you need row filtering or view support:

HDFS JDBC
How it works Hive Reader accesses Hive Metastore to get the table's HDFS file path, format, and delimiters, then reads directly from Hadoop Distributed File System (HDFS) files Hive Reader connects to HiveServer2 via the Hive JDBC client and reads data using SQL
Performance Higher Lower — generates a MapReduce program
Conditional filtering (`where` clause) Not supported Supported
Reading views Not supported Supported
UI label Read data from HDFS files. Read data using Hive JDBC (supports conditional filtering).

Use HDFS for maximum throughput when you don't need filtering. Use JDBC when you need to filter rows with a where clause or read from Hive views.

Key parameters

Parameter Description Required
Hive Read Method Select Read data from HDFS files. or Read data using Hive JDBC (supports conditional filtering). See the comparison table above. Yes
Table Select the Hive table to sync. The UI shows tables and schema from the development environment only. Make sure the table schema is identical in both development and production environments — if they differ, the task may fail in production with "table not found" or "column not found" errors. Yes
Parquet schema Required if the Hive table is stored in Parquet format. Conditional

Configure the data destination (MaxCompute)

Note

Parameters not listed in the following table can be left at their default values.

Parameter Description Required
Tunnel Resource Group The MaxCompute tunnel quota used for data transfer. Defaults to Public transport resources, the free quota provided by MaxCompute. If your exclusive tunnel quota becomes unavailable due to overdue payments or expiration, the task automatically switches to Public transport resources at runtime. Yes
Table Select the target MaxCompute table. In a standard DataWorks workspace, a table with the same name and a consistent schema must exist in both development and production environments. You can also click Generate Destination Table Schema to let the system create a table automatically — adjust the table creation statement as needed. Yes
Partition Information Required if the destination table is partitioned. Enter a fixed value (for example, ds=20220101) or a scheduling parameter (for example, ds=${bizdate}). The system substitutes scheduling parameters at runtime. Conditional
Write Method Select overwrite to replace existing data or append to add to it. Yes
Note

Watch for these common issues with the destination table:

  • If the table doesn't exist in the development environment, it won't appear in the destination table drop-down list.

  • If the table doesn't exist in the production environment, the sync task fails after publishing.

  • If the table schema differs between development and production, column mapping during scheduled runs may deviate from the configured mapping, causing incorrect data writes.

Step 3: Configure and validate the task

Field mapping: Use Map Fields with the Same Name or Map Fields in the Same Line to auto-map columns. If the field order or names differ between source and destination, adjust the mappings manually.

Channel control: Set Policy for Dirty Data Records to reject dirty data to protect data quality. Leave other parameters at their defaults initially.

Step 4: Debug the task

  1. On the right side of the batch synchronization node configuration page, click Run Configuration. Set the Resource Group and Script Parameters for the debug run, then click Run in the top toolbar to verify the sync pipeline runs successfully.

  2. Run a spot-check query against the destination table:

    1. In the left-side navigation pane, click image, then click image to the right of Personal Directory to create a file with a .sql extension.

    2. Run the following query and verify the results match expectations:

    Note

    To query data this way, bind the destination MaxCompute project as a computing resource for DataWorks. On the right side of the .sql file editing page, click Run Configuration, specify the Type, Computing Resources, and Resource Group, then click Run in the top toolbar.

    SELECT * FROM <your_maxcompute_destination_table> WHERE pt=<your_partition> LIMIT 20;

Step 5: Schedule and publish the task

On the right side of the batch synchronization node, click Scheduling Settings. Configure the periodic run parameters as described in Scheduling configuration. Then click Publish in the top toolbar and follow the on-screen instructions to publish the task.