All Products
Search
Document Center

DataWorks:Synchronize data from a single EMR Hive table to MaxCompute in offline mode

Last Updated:Nov 04, 2025

This topic describes the best practices for data source configuration, network connectivity, and sync task configuration, using the scenario of synchronizing data from a single EMR Hive table to MaxCompute in offline mode as an example.

Background information

Hive is a Hadoop-based data warehouse tool that is used to extract, transform, and load data. You can use Hive to store, query, and analyze large-scale data that is stored in Hadoop. Hive maps structured data files to a database table, supports SQL queries, and converts SQL statements into Map or Reduce tasks. Data Integration supports the synchronization of data between Hive and another type of data source.

Prerequisites

  • You have purchased a Serverless resource group or an exclusive resource group for Data Integration.

  • You have created a Hive data source and a MaxCompute data source. For more information, see Data Source Configuration.

  • You have established network connectivity between the resource group and the data source. For more information, see Network Connectivity Solution Overview.

    Note

    If you connect the exclusive resource group for Data Integration and EMR over the internet, you must add a security group rule for the EMR cluster. This rule must allow inbound access from the EIPs of the resource group to the EMR cluster access ports, such as ports 10000, 9093, and 8020.

Limits

Data synchronization from a source to a MaxCompute foreign table is not supported.

Procedure

Note

This topic uses operations in the Data Development (DataStudio) (New Version) interface as an example to demonstrate how to configure an offline sync task.

1. Create an offline sync node

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the navigation pane on the left, click image, and then click the image icon to the right of Project Folder. Select New Node > Data Integration > Offline Sync. Customize the Name for the offline sync task node and click Confirm.

2. Configure network and resources

  1. In the Network and Resource Configuration step, select the Data Source, Resource Group, and Data Destination for the sync task. You can also set the number of CUs for the task in the Task Resource Usage section.

    • For Data Source, select the Hive data source that you added.

    • For Data Destination, select the MaxCompute data source that you added.

    • For Resource Group, select a resource group that is connected to the Hive and MaxCompute data sources. You can also allocate the number of CUs that this task will Occupy.

  2. On the Data Source and Data Destination cards, click Test Connectivity.

    image

  3. After you confirm that the data source and the data destination are successfully connected, click Next.

3. Configure the source and destination

Configure source (Hive) parameters

The data source is a Hive table. The key configuration parameters are described below.

image

Parameter

Key configuration points

Data Source

The Hive data source that you selected in the previous step is displayed by default.

Method to Read Data of Hive Data Source

  • Read Data based on HDFS Files: Hive Reader connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file that corresponds to your Hive table. Then, Hive Reader reads data from the HDFS file.

  • Read Data by Using Hive JDBC: Hive Reader connects to HiveServer2 using Hive Java Database Connectivity (JDBC) and reads data. This method lets you specify a WHERE clause to filter data and execute SQL statements to read data.

Note

If you read data based on HDFS files, the data synchronization efficiency is higher. If you read data using Hive JDBC, MapReduce programs are generated. As a result, the data synchronization efficiency is lower. If you read data based on HDFS files, you cannot specify filter conditions or read views. You can select a synchronization method as needed.

Table

Select the Hive table whose data you want to synchronize. The table of the data source in the production environment and that in the development environment must have the same schema.

Note

The tables and table schemas of the Hive data source in the development environment are displayed. If the table that you select has a different schema from the table in the production environment, an error indicating that the table or a column does not exist occurs after you submit the task to the production environment.

parquet schema

If you store the Hive table in the Parquet format, you must configure the parquet schema parameter.

Configure destination (MaxCompute) parameters

The data destination is a MaxCompute table. The key configuration parameters are described below.

image

Note

You can keep the default values for the parameters that are not described in the following table.

Configuration item

Key configuration

Data Source

The MaxCompute data source that you selected in the previous step is displayed by default. If you use a standard mode DataWorks workspace, the names of the development and production projects are displayed separately.

Table

Select the destination MaxCompute table. If you use a standard mode DataWorks workspace, make sure that a MaxCompute table with the same name and schema exists in both the development and production environments.

You can also click to generate the destination table schema. The system automatically creates a table to receive data. You can manually adjust the table creation statement.

Note

If:

  • If the destination MaxCompute table does not exist in the development environment, you cannot find the table in the drop-down list for the destination table of the offline sync node.

  • If the destination MaxCompute table does not exist in the production environment, the data sync task fails to run after it is submitted and published because the table cannot be found.

  • If the table schemas in the development and production environments are inconsistent, the column mapping during the actual scheduled run of the sync task may be different from the column mapping configured for the offline sync node. This can lead to incorrect data writing.

Partition Information

If the table is a partitioned table, you can enter a value for the partition key column.

  • The value can be a static value, such as ds=20220101.

  • The value can be a scheduling system parameter, such as ds=${bizdate}, which is automatically replaced when the task runs.

4. Configure field mapping

After you configure the source and destination, you must configure the mappings between the source and destination fields. You can select Same Name Mapping, Same Row Mapping, Cancel Mapping, or Manually Edit Mappings.

5. Configure channel control

Offline sync tasks support settings such as Maximum Concurrency and Dirty Data Policy. In this example, Dirty Data Policy is set to Do Not Tolerate Dirty Data, and the other settings use their default values. For more information, see Codeless UI Configuration.

6. Debug and run the configuration

  1. On the right side of the configuration tab for the offline synchronization node, click Debug Configuration. Set the Resource Group and Script Parameters for the debug run, and then click Run in the top toolbar to test whether the synchronization task runs as expected.

  2. In the navigation pane on the left, you can click image and then click image to the right of Personal Folder to create a file with the .sql extension. Then, you can execute the following SQL query to verify that the data in the destination table is as expected.

    Note
    SELECT * FROM <Destination_MaxCompute_table_name> WHERE pt=<Specified_partition> LIMIT 20;

7. Configure scheduling and publish the task

On the right side of the offline sync task tab, click Scheduling Configuration. After you set the required scheduling configuration parameters for periodic runs, click Publish in the top toolbar and follow the on-screen instructions to complete the publishing process.