This topic provides an example of how to synchronize data from a single MaxCompute table to ApsaraDB for ClickHouse in offline mode. It describes the best practices for data source configuration, network connectivity, and sync task configuration.
Background
ApsaraDB for ClickHouse is a column-oriented database for online analytical processing (OLAP). Data Integration lets you synchronize data to and from ApsaraDB for ClickHouse. This topic provides an example of how to synchronize data from a single MaxCompute table to ApsaraDB for ClickHouse in offline mode to describe the complete process.
Limitations
Offline synchronization of a single table to ClickHouse supports only ApsaraDB for ClickHouse.
Prerequisites
You have purchased a Serverless resource group or an exclusive resource group for Data Integration.
You have created a MaxCompute data source and a ClickHouse data source. For more information, see Data Source Configuration.
You have established network connectivity between the resource group and the data source. For more information, see Network Connectivity Solution.
Procedure
This topic provides an example of how to configure an offline sync task on the DataStudio (new version) UI.
1. Create an offline synchronization node
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the navigation pane on the left, click the
icon. To the right of Project Directory, click the
icon and choose . Enter a custom Name for the offline sync node and click Confirm.
2. Configure the network and resources
In the Configure Network and Resources step, select the Source, Resource Group, and Destination for the sync task. You can allocate a specific number of CUs for the Task Resource Usage.
For Source, select the
MaxComputedata source that you added.For Destination, select the
ClickHousedata source that you added.For Resource Group, select the resource group that is connected to both the
MaxComputeandClickHousedata sources. You can allocate a specific number of CUs for the task resource Usage.
On the Source and Destination cards, click Test Connectivity.

After you confirm that the connections to the source and destination are successful, click Next.
3. Configure the source and destination
Configure source (MaxCompute) parameters
The following are the key parameters for configuring the source MaxCompute table.

Configuration item | Configuration details |
Data Source | The MaxCompute data source that you selected in the previous step is displayed by default. If your DataWorks workspace is in standard mode, the names of the development and production projects are displayed separately. |
Tunnel Resource Group | This tutorial uses the Public Transmission Resource by default. If you have an exclusive Tunnel Quota, you can select it from the drop-down list. |
Table | Select the MaxCompute table to synchronize. If your DataWorks workspace is in standard mode, make sure that a MaxCompute table with the same name and schema exists in both the development and production environments of MaxCompute. Note If:
|
Filter Method | You can filter data using Partition Filter or Data Filtering.
|
Partition Info | This parameter is required when you set Filter Method to Partition Filter. You can enter the value of the partition key column.
|
When Partition Does Not Exist | The policy to handle the synchronization task when a partition does not exist. The following options are supported:
|
Configure destination (ClickHouse) parameters
The following are the key parameters for configuring the destination ClickHouse table.

Configuration item | Configuration details |
Data Source | The ClickHouse data source that you selected in the previous step is displayed by default. |
Table | Select the ClickHouse table to synchronize. We recommend that you ensure the table to be synchronized has the same table schema in the development and production environments of the ClickHouse data source. Note This section displays the list of tables from the development environment of the ClickHouse data source. If the table definitions in your development and production environments for the ClickHouse data source are different, the task may report that the table or a column does not exist after being submitted to the production environment, even if it was configured correctly in the development environment. |
Primary Key Or Unique Key Conflict Handling | When you select |
Pre-SQL Statement | You can execute SQL statements as needed before and after the data synchronization task runs. For example, before synchronizing data by day, you can clear the data in the corresponding daily partition to ensure that the partition is empty before the data is written. |
Post-import Statements | |
Batch Size (Bytes) | Data is written to ClickHouse in batches. These parameters specify the upper limits for the batch size in bytes and the number of records. When the read data reaches either the byte limit or the record count limit, a batch is considered complete. Each completed batch is then written to ClickHouse. We recommend that you set Batch Size (Bytes) to 16777216 (16 MB). We recommend that you set Batch Size (Records) to a large value based on the size of a single record. This way, the batch write is triggered by the batch size in bytes. For example, if a single record is 1 KB, set Batch Size (Bytes) to 16777216 (16 MB) and Batch Size (Records) to 20000 (which is greater than 16 MB/1 KB = 16384). In this case, the write is triggered by the batch size in bytes, and a write occurs every 16 MB. |
Batch Size (Records) | |
When Batch Write To ClickHouse Is Abnormal | When a batch write to ClickHouse is abnormal, you can select an error handling policy:
|
4. Configure field mappings
After you select the source and destination, you must specify the column mappings between them. You can select Map By Name, Map By Position, Unmap, or Auto Layout.
5. Configure channel control
Offline sync tasks support settings such as Task Concurrency and Dirty Data Policy. In this tutorial, the Dirty Data Policy is set to Do Not Tolerate Dirty Data, and the other configurations use the default settings. For more information, see Configure an offline sync task in codeless UI.
6. Configure and run the debug task
On the right side of the offline sync node editing page, click Run Configuration. Set the Resource Group and Script Parameters for the debug run. Then, click Run in the top toolbar to test whether the sync link runs successfully.
In the navigation pane on the left, click the
icon. To the right of Personal Directory, click the
icon and create a file with a .sqlextension. Run the following SQL statement to query the data in the destination table and check whether it meets your expectations.NoteThis query method requires attaching the target ClickHouse to DataWorks as a computing resource.
On the right side of the
.sqlfile editing page, click Run Configuration. Specify the data source Type, Computing Resource, and Resource Group. Then, click Run in the top toolbar.
SELECT * FROM <clickhouse_target_table_name> LIMIT 20;
7. Configure scheduling and publish the node
Click Scheduling Configuration on the right side of the offline sync task page. After you configure the scheduling configuration parameters that are required for periodic runs, click Publish in the top toolbar and follow the on-screen instructions to complete the publishing process.
Appendix: Adjust memory parameters
If the sync speed does not increase significantly after you increase the concurrency, you can manually adjust the memory parameters of the sync task. Perform the following steps:
In the toolbar at the top of the offline sync task page, click Code Editor to switch the task from the codeless UI to the code editor.

In the
settingsection of the JSON script, add thejvmOptionparameter, in the format-Xms${heapMem} -Xmx${heapMem} -Xmn${newMem}.
In the codeless UI, the system calculates the value of ${heapMem} using the formula: 768 MB + (concurrency - 1) × 256 MB. We recommend that you set ${heapMem} to a larger value in the code editor and set ${newMem} to one-third of the value of ${heapMem}. For example, if the concurrency is 8, the default value calculated for ${heapMem} in the codeless UI is 2560 MB. In the code editor, you can set a larger value. For example, set jvmOption to -Xms3072m -Xmx3072m -Xmn1024m.