This topic uses an example to demonstrate how to batch-synchronize data from a single MaxCompute table to ApsaraDB for ClickHouse and outlines best practices for Data Source Configuration, network connectivity, and Batch Synchronization Task configuration.
ApsaraDB for ClickHouse overview
ApsaraDB for ClickHouse is a column-oriented database designed for Online Analytical Processing (OLAP). Data Integration supports synchronizing data from ApsaraDB for ClickHouse to other destinations and from other sources to ApsaraDB for ClickHouse. This topic provides an end-to-end example of batch-synchronizing data from a single MaxCompute table to ApsaraDB for ClickHouse.
Limitations
Single-table batch synchronization is supported only for ApsaraDB for ClickHouse.
Prerequisites
A MaxCompute data source and an ApsaraDB for ClickHouse data source. For more information, see Data Source Configuration.
Network connectivity between the resource group and the data sources. For more information, see Network Connectivity Solution.
Procedure
This topic demonstrates how to configure a Batch Synchronization Task in the DataStudio (new version) UI.
Create a node and configure the task
This topic does not detail the general steps for creating and configuring a node by using the codeless UI. For this information, see Configure a node in the codeless UI.
Configure the source and destination
In the data source and resource sections, set the Source to your MaxCompute data source and the Destination to your ApsaraDB for ClickHouse data source. Then, select a Resource Group and test the connectivity.
Configure Source (MaxCompute) parameters
The key parameters for the source MaxCompute table are described below.
Parameter | Description |
Tunnel Resource Group | By default, Public Transmission Resource is used. If you have an exclusive Tunnel Quota, you can select it from the drop-down list. |
Table | Select the MaxCompute table that you want to synchronize. If you are using a standard DataWorks workspace, ensure that a MaxCompute table with the same name and table schema exists in both the development and production environments.
|
Filtering Method | Supports Partition Filtering and Data Filter.
|
Partition | This parameter is required when Filtering Method is set to Partition. You can enter the value of the partition column.
|
If partitions do not exist, | Specifies the policy to apply when a specified partition does not exist. Valid values:
|
Configure Destination (ApsaraDB for ClickHouse) parameters
The key parameters for the destination ApsaraDB for ClickHouse table are described below.
Parameter | Description |
Table | Select the ApsaraDB for ClickHouse table to which you want to synchronize data. We recommend that the table schemas for the ApsaraDB for ClickHouse data source be identical in the development and production environments. Note The table list from the ApsaraDB for ClickHouse data source in the development environment is displayed here. If the table definitions in your development and production environments differ, the task may appear to configure correctly but fail after being published to the production environment with an error indicating that the table or a column does not exist. |
Primary or Unique Key Conflict Handling | When you select |
Statement Run Before Writing | You can run SQL statements before and after the data synchronization task as needed. For example, before a daily synchronization, you can run a statement to clear the corresponding daily partition, ensuring it is empty before new data is written. |
Statement Run After Writing | |
Batch Insert Size (Bytes) | Data is written to ApsaraDB for ClickHouse in batches. These parameters define the upper limit for the size in bytes and the number of records per batch. When the amount of cached data reaches either the specified byte size or record count, a batch write is triggered. We recommend that you set Batch size (Bytes) to 16777216 (16 MB) and set Batch size (Records) to a large value based on your single record size. This ensures that batch writes are primarily triggered by the batch size in bytes. For example, if a single record is 1 KB, you can set Batch size (Bytes) to 16777216 (16 MB) and Batch size (Records) to 20000 (which is greater than 16 MB/1 KB = 16384). In this case, writes are triggered each time the batch size reaches 16 MB. |
Data Records Per Write | |
If a batch write fails | Specifies the policy for handling exceptions that occur during batch writes to ApsaraDB for ClickHouse:
|
Configure field mapping
After you select a source and a destination, you need to specify the mappings between the source and destination columns. You can select Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Auto Layout.
Advanced settings
You can configure settings for offline synchronization tasks, such as Expected Maximum Concurrency and Policy for Dirty Data Records. In this tutorial, the Policy for Dirty Data Records is set to Disallow Dirty Data Records, and the other settings use their default values. For more information, see Codeless UI configuration.
Configure and run the task
Click Run Configuration on the right side of the Batch Synchronization node edit page, configure the Resource Group and Script Parameters for the debug run, and then click Run in the top toolbar to test if the synchronization link runs successfully.
You can click
in the left navigation bar, and then click the new icon to the right of Personal Directory to create a new SQL file. Execute the following SQL statement to query the data in the destination table and verify that the data is as expected.NoteTo query data this way, you must bind ApsaraDB for ClickHouse as a computing resource in DataWorks.
You need to click Run Configuration on the right side of the
.sqlfile editing page, specify the data source Type, Computing Resources, and Resource Group, and then click Run in the top toolbar.
SELECT * FROM <your_clickhouse_destination_table_name> LIMIT 20;
Configure scheduling and publish the task
Click Scheduling Settings to the right of the offline synchronization task. After you configure the Scheduling Configuration parameters for scheduled runs, click Publish in the top toolbar. In the Publish panel, follow the on-screen prompts to complete the publication.
Appendix: Adjust memory parameters
If increasing the concurrency does not significantly improve the synchronization throughput, you can manually adjust the memory parameters for the synchronization task. Follow these steps:
Click Code Editor on the top toolbar of the Offline Synchronization Task page to switch the task from the codeless UI to the code editor.

In the
settingsection of the script's JSON segment, add thejvmOptionparameter. The parameter is in the format of-Xms${heapMem} -Xmx${heapMem} -Xmn${newMem}.
In the codeless UI, the system calculates the value of ${heapMem} by using the formula 768 MB + (concurrency level - 1) * 256 MB. We recommend that you set ${heapMem} to a larger value in the code editor and set ${newMem} to one-third of the value of ${heapMem}. For example, if the concurrency level is 8, the default value for ${heapMem} in the codeless UI is 2560 MB, and you can set a larger value in the code editor, such as setting the jvmOption parameter to -Xms3072m -Xmx3072m -Xmn1024m.