This guide walks you through configuring a batch synchronization node to copy a single E-MapReduce (EMR) Hive table to MaxCompute — covering data source setup, network connectivity, node configuration, and scheduling.
Limitations
Syncing source data to MaxCompute external tables is not supported.
Prerequisites
Before you begin, make sure you have:
-
A Hive data source and a MaxCompute data source — see Data source configuration
-
Network connectivity between the resource group and the data source — see Network connectivity solutions
NoteIf you use a public endpoint to connect an exclusive resource group to EMR, configure the security group rules for the EMR cluster to allow inbound access from the elastic IP address (EIP) of the exclusive resource group. The inbound rules must open ports 10000, 9093, and 8020.
Step 1: Create a node and configure a task
For the general steps to create a node and use the Codeless UI, see Configure a task in the Codeless UI.
Step 2: Configure the data source and destination
Configure the data source (Hive)
Choose a read method
Two read methods are available. Choose based on whether you need row filtering or view support:
| HDFS | JDBC | |
|---|---|---|
| How it works | Hive Reader accesses Hive Metastore to get the table's HDFS file path, format, and delimiters, then reads directly from Hadoop Distributed File System (HDFS) files | Hive Reader connects to HiveServer2 via the Hive JDBC client and reads data using SQL |
| Performance | Higher | Lower — generates a MapReduce program |
| Conditional filtering (`where` clause) | Not supported | Supported |
| Reading views | Not supported | Supported |
| UI label | Read data from HDFS files. | Read data using Hive JDBC (supports conditional filtering). |
Use HDFS for maximum throughput when you don't need filtering. Use JDBC when you need to filter rows with a where clause or read from Hive views.
Key parameters
| Parameter | Description | Required |
|---|---|---|
| Hive Read Method | Select Read data from HDFS files. or Read data using Hive JDBC (supports conditional filtering). See the comparison table above. | Yes |
| Table | Select the Hive table to sync. The UI shows tables and schema from the development environment only. Make sure the table schema is identical in both development and production environments — if they differ, the task may fail in production with "table not found" or "column not found" errors. | Yes |
| Parquet schema | Required if the Hive table is stored in Parquet format. | Conditional |
Configure the data destination (MaxCompute)
Parameters not listed in the following table can be left at their default values.
| Parameter | Description | Required |
|---|---|---|
| Tunnel Resource Group | The MaxCompute tunnel quota used for data transfer. Defaults to Public transport resources, the free quota provided by MaxCompute. If your exclusive tunnel quota becomes unavailable due to overdue payments or expiration, the task automatically switches to Public transport resources at runtime. | Yes |
| Table | Select the target MaxCompute table. In a standard DataWorks workspace, a table with the same name and a consistent schema must exist in both development and production environments. You can also click Generate Destination Table Schema to let the system create a table automatically — adjust the table creation statement as needed. | Yes |
| Partition Information | Required if the destination table is partitioned. Enter a fixed value (for example, ds=20220101) or a scheduling parameter (for example, ds=${bizdate}). The system substitutes scheduling parameters at runtime. |
Conditional |
| Write Method | Select overwrite to replace existing data or append to add to it. | Yes |
Watch for these common issues with the destination table:
-
If the table doesn't exist in the development environment, it won't appear in the destination table drop-down list.
-
If the table doesn't exist in the production environment, the sync task fails after publishing.
-
If the table schema differs between development and production, column mapping during scheduled runs may deviate from the configured mapping, causing incorrect data writes.
Step 3: Configure and validate the task
Field mapping: Use Map Fields with the Same Name or Map Fields in the Same Line to auto-map columns. If the field order or names differ between source and destination, adjust the mappings manually.
Channel control: Set Policy for Dirty Data Records to reject dirty data to protect data quality. Leave other parameters at their defaults initially.
Step 4: Debug the task
-
On the right side of the batch synchronization node configuration page, click Run Configuration. Set the Resource Group and Script Parameters for the debug run, then click Run in the top toolbar to verify the sync pipeline runs successfully.
-
Run a spot-check query against the destination table:
-
In the left-side navigation pane, click
, then click
to the right of Personal Directory to create a file with a .sqlextension. -
Run the following query and verify the results match expectations:
NoteTo query data this way, bind the destination MaxCompute project as a computing resource for DataWorks. On the right side of the
.sqlfile editing page, click Run Configuration, specify the Type, Computing Resources, and Resource Group, then click Run in the top toolbar.SELECT * FROM <your_maxcompute_destination_table> WHERE pt=<your_partition> LIMIT 20; -
Step 5: Schedule and publish the task
On the right side of the batch synchronization node, click Scheduling Settings. Configure the periodic run parameters as described in Scheduling configuration. Then click Publish in the top toolbar and follow the on-screen instructions to publish the task.