All Products
Search
Document Center

DataWorks:Batch-synchronize a single MaxCompute table to ClickHouse

Last Updated:Mar 06, 2026

This topic uses an example to demonstrate how to batch-synchronize data from a single MaxCompute table to ApsaraDB for ClickHouse and outlines best practices for Data Source Configuration, network connectivity, and Batch Synchronization Task configuration.

ApsaraDB for ClickHouse overview

ApsaraDB for ClickHouse is a column-oriented database designed for Online Analytical Processing (OLAP). Data Integration supports synchronizing data from ApsaraDB for ClickHouse to other destinations and from other sources to ApsaraDB for ClickHouse. This topic provides an end-to-end example of batch-synchronizing data from a single MaxCompute table to ApsaraDB for ClickHouse.

Limitations

Single-table batch synchronization is supported only for ApsaraDB for ClickHouse.

Prerequisites

Procedure

Note

This topic demonstrates how to configure a Batch Synchronization Task in the DataStudio (new version) UI.

Create a node and configure the task

This topic does not detail the general steps for creating and configuring a node by using the codeless UI. For this information, see Configure a node in the codeless UI.

Configure the source and destination

In the data source and resource sections, set the Source to your MaxCompute data source and the Destination to your ApsaraDB for ClickHouse data source. Then, select a Resource Group and test the connectivity.

Configure Source (MaxCompute) parameters

The key parameters for the source MaxCompute table are described below.

Parameter

Description

Tunnel Resource Group

By default, Public Transmission Resource is used. If you have an exclusive Tunnel Quota, you can select it from the drop-down list.

Table

Select the MaxCompute table that you want to synchronize. If you are using a standard DataWorks workspace, ensure that a MaxCompute table with the same name and table schema exists in both the development and production environments.

  • You cannot find the source table in the drop-down list during configuration if it does not exist in the development environment.

  • If the source table does not exist in the production environment, the synchronization task fails after it is published and scheduled to run because the table cannot be found.

  • If the table schemas in the development and production environments are inconsistent, the field mapping used during task execution may differ from the mapping that you configure. This can cause incorrect data to be written.

Filtering Method

Supports Partition Filtering and Data Filter.

  • Partition Filtering: Filters the synchronization scope of specified source data by using a partition expression. When you select this method, you need to configure the Partition Information and If partitions do not exist, parameters.

  • Data Filter: Specifies the source data to synchronize by using an SQL WHERE clause. (Do not include the WHERE keyword.)

Partition

This parameter is required when Filtering Method is set to Partition. You can enter the value of the partition column.

  • The value can be a fixed value, such as ds=20220101.

  • The value can be a scheduling system parameter, such as ds=${bizdate}. When the task runs, the system automatically replaces this parameter with its corresponding value.

  • You can specify multiple partitions.

If partitions do not exist,

Specifies the policy to apply when a specified partition does not exist. Valid values:

  • Error.

  • the partitions are ignored and tasks are normally run.

Configure Destination (ApsaraDB for ClickHouse) parameters

The key parameters for the destination ApsaraDB for ClickHouse table are described below.

Parameter

Description

Table

Select the ApsaraDB for ClickHouse table to which you want to synchronize data. We recommend that the table schemas for the ApsaraDB for ClickHouse data source be identical in the development and production environments.

Note

The table list from the ApsaraDB for ClickHouse data source in the development environment is displayed here. If the table definitions in your development and production environments differ, the task may appear to configure correctly but fail after being published to the production environment with an error indicating that the table or a column does not exist.

Primary or Unique Key Conflict Handling

When you select insert into, the integration task validates the uniqueness of the primary key and constraint conflicts at runtime.

Statement Run Before Writing

You can run SQL statements before and after the data synchronization task as needed. For example, before a daily synchronization, you can run a statement to clear the corresponding daily partition, ensuring it is empty before new data is written.

Statement Run After Writing

Batch Insert Size (Bytes)

Data is written to ApsaraDB for ClickHouse in batches. These parameters define the upper limit for the size in bytes and the number of records per batch. When the amount of cached data reaches either the specified byte size or record count, a batch write is triggered.

We recommend that you set Batch size (Bytes) to 16777216 (16 MB) and set Batch size (Records) to a large value based on your single record size. This ensures that batch writes are primarily triggered by the batch size in bytes.

For example, if a single record is 1 KB, you can set Batch size (Bytes) to 16777216 (16 MB) and Batch size (Records) to 20000 (which is greater than 16 MB/1 KB = 16384). In this case, writes are triggered each time the batch size reaches 16 MB.

Data Records Per Write

If a batch write fails

Specifies the policy for handling exceptions that occur during batch writes to ApsaraDB for ClickHouse:

  • Try single-row write. If it still fails, record it as dirty data.: Single-row writes can be used to identify a small amount of dirty data, but they have a significant impact on the ClickHouse load. You must select this option with caution.

  • Exit on Sync Task Failure: To ensure ClickHouse stability, you can choose to have the task exit when a write error occurs. By configuring alerts, you can be notified promptly and manually intervene to handle the error.

  • Count batch write failures as dirty data.: If you want to ensure that a task runs to completion and can tolerate discarded batch data, you can choose to count the batch data as dirty data. This allows you to use the dirty data limit to control whether the task exits.

Configure field mapping

After you select a source and a destination, you need to specify the mappings between the source and destination columns. You can select Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Auto Layout.

Advanced settings

You can configure settings for offline synchronization tasks, such as Expected Maximum Concurrency and Policy for Dirty Data Records. In this tutorial, the Policy for Dirty Data Records is set to Disallow Dirty Data Records, and the other settings use their default values. For more information, see Codeless UI configuration.

Configure and run the task

  1. Click Run Configuration on the right side of the Batch Synchronization node edit page, configure the Resource Group and Script Parameters for the debug run, and then click Run in the top toolbar to test if the synchronization link runs successfully.

  2. You can click image in the left navigation bar, and then click the new icon to the right of Personal Directory to create a new SQL file. Execute the following SQL statement to query the data in the destination table and verify that the data is as expected.

    Note
    SELECT * FROM <your_clickhouse_destination_table_name> LIMIT 20;

Configure scheduling and publish the task

Click Scheduling Settings to the right of the offline synchronization task. After you configure the Scheduling Configuration parameters for scheduled runs, click Publish in the top toolbar. In the Publish panel, follow the on-screen prompts to complete the publication.

Appendix: Adjust memory parameters

If increasing the concurrency does not significantly improve the synchronization throughput, you can manually adjust the memory parameters for the synchronization task. Follow these steps:

  1. Click Code Editor on the top toolbar of the Offline Synchronization Task page to switch the task from the codeless UI to the code editor.

    image

  2. In the setting section of the script's JSON segment, add the jvmOption parameter. The parameter is in the format of -Xms${heapMem} -Xmx${heapMem} -Xmn${newMem}.

    image

In the codeless UI, the system calculates the value of ${heapMem} by using the formula 768 MB + (concurrency level - 1) * 256 MB. We recommend that you set ${heapMem} to a larger value in the code editor and set ${newMem} to one-third of the value of ${heapMem}. For example, if the concurrency level is 8, the default value for ${heapMem} in the codeless UI is 2560 MB, and you can set a larger value in the code editor, such as setting the jvmOption parameter to -Xms3072m -Xmx3072m -Xmn1024m.