This topic describes the best practices for data source configuration, network connectivity, and sync task configuration, using the scenario of synchronizing data from a single EMR Hive table to MaxCompute in offline mode as an example.
Background information
Hive is a Hadoop-based data warehouse tool that is used to extract, transform, and load data. You can use Hive to store, query, and analyze large-scale data that is stored in Hadoop. Hive maps structured data files to a database table, supports SQL queries, and converts SQL statements into Map or Reduce tasks. Data Integration supports the synchronization of data between Hive and another type of data source.
Prerequisites
You have purchased a Serverless resource group or an exclusive resource group for Data Integration.
You have created a Hive data source and a MaxCompute data source. For more information, see Data Source Configuration.
You have established network connectivity between the resource group and the data source. For more information, see Network Connectivity Solution Overview.
NoteIf you connect the exclusive resource group for Data Integration and EMR over the internet, you must add a security group rule for the EMR cluster. This rule must allow inbound access from the EIPs of the resource group to the EMR cluster access ports, such as ports 10000, 9093, and 8020.
Limits
Data synchronization from a source to a MaxCompute foreign table is not supported.
Procedure
This topic uses operations in the Data Development (DataStudio) (New Version) interface as an example to demonstrate how to configure an offline sync task.
1. Create an offline sync node
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the navigation pane on the left, click
, and then click the
icon to the right of Project Folder. Select . Customize the Name for the offline sync task node and click Confirm.
2. Configure network and resources
In the Network and Resource Configuration step, select the Data Source, Resource Group, and Data Destination for the sync task. You can also set the number of CUs for the task in the Task Resource Usage section.
For Data Source, select the
Hivedata source that you added.For Data Destination, select the
MaxComputedata source that you added.For Resource Group, select a resource group that is connected to the
HiveandMaxComputedata sources. You can also allocate the number of CUs that this task will Occupy.
On the Data Source and Data Destination cards, click Test Connectivity.

After you confirm that the data source and the data destination are successfully connected, click Next.
3. Configure the source and destination
Configure source (Hive) parameters
The data source is a Hive table. The key configuration parameters are described below.

Parameter | Key configuration points |
Data Source | The Hive data source that you selected in the previous step is displayed by default. |
Method to Read Data of Hive Data Source |
Note If you read data based on HDFS files, the data synchronization efficiency is higher. If you read data using Hive JDBC, MapReduce programs are generated. As a result, the data synchronization efficiency is lower. If you read data based on HDFS files, you cannot specify filter conditions or read views. You can select a synchronization method as needed. |
Table | Select the Hive table whose data you want to synchronize. The table of the data source in the production environment and that in the development environment must have the same schema. Note The tables and table schemas of the Hive data source in the development environment are displayed. If the table that you select has a different schema from the table in the production environment, an error indicating that the table or a column does not exist occurs after you submit the task to the production environment. |
parquet schema | If you store the Hive table in the Parquet format, you must configure the parquet schema parameter. |
Configure destination (MaxCompute) parameters
The data destination is a MaxCompute table. The key configuration parameters are described below.

You can keep the default values for the parameters that are not described in the following table.
Configuration item | Key configuration |
Data Source | The MaxCompute data source that you selected in the previous step is displayed by default. If you use a standard mode DataWorks workspace, the names of the development and production projects are displayed separately. |
Table | Select the destination MaxCompute table. If you use a standard mode DataWorks workspace, make sure that a MaxCompute table with the same name and schema exists in both the development and production environments. You can also click to generate the destination table schema. The system automatically creates a table to receive data. You can manually adjust the table creation statement. Note If:
|
Partition Information | If the table is a partitioned table, you can enter a value for the partition key column.
|
4. Configure field mapping
After you configure the source and destination, you must configure the mappings between the source and destination fields. You can select Same Name Mapping, Same Row Mapping, Cancel Mapping, or Manually Edit Mappings.
5. Configure channel control
Offline sync tasks support settings such as Maximum Concurrency and Dirty Data Policy. In this example, Dirty Data Policy is set to Do Not Tolerate Dirty Data, and the other settings use their default values. For more information, see Codeless UI Configuration.
6. Debug and run the configuration
On the right side of the configuration tab for the offline synchronization node, click Debug Configuration. Set the Resource Group and Script Parameters for the debug run, and then click Run in the top toolbar to test whether the synchronization task runs as expected.
In the navigation pane on the left, you can click
and then click
to the right of Personal Folder to create a file with the .sqlextension. Then, you can execute the following SQL query to verify that the data in the destination table is as expected.NoteThis query method requires you to attach the target MaxCompute project as a computing resource for DataWorks.
On the right side of the
.sqlfile configuration tab, click Debug Configuration. Specify the data source Type, Computing Resource, and Resource Group, and then click Run in the top toolbar.
SELECT * FROM <Destination_MaxCompute_table_name> WHERE pt=<Specified_partition> LIMIT 20;
7. Configure scheduling and publish the task
On the right side of the offline sync task tab, click Scheduling Configuration. After you set the required scheduling configuration parameters for periodic runs, click Publish in the top toolbar and follow the on-screen instructions to complete the publishing process.