All Products
Search
Document Center

DataWorks:Synchronize data from a single MaxCompute table to ClickHouse in offline mode

Last Updated:Feb 08, 2026

This topic provides an example of how to synchronize data from a single MaxCompute table to ApsaraDB for ClickHouse in offline mode. It describes the best practices for data source configuration, network connectivity, and sync task configuration.

Background

ApsaraDB for ClickHouse is a column-oriented database for online analytical processing (OLAP). Data Integration lets you synchronize data to and from ApsaraDB for ClickHouse. This topic provides an example of how to synchronize data from a single MaxCompute table to ApsaraDB for ClickHouse in offline mode to describe the complete process.

Limitations

Offline synchronization of a single table to ClickHouse supports only ApsaraDB for ClickHouse.

Prerequisites

Procedure

Note

This topic provides an example of how to configure an offline sync task on the DataStudio (new version) UI.

1. Create an offline synchronization node

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the navigation pane on the left, click the image icon. To the right of Project Directory, click the image icon and choose New Node > Data Integration > Offline Synchronization. Enter a custom Name for the offline sync node and click Confirm.

2. Configure the network and resources

  1. In the Configure Network and Resources step, select the Source, Resource Group, and Destination for the sync task. You can allocate a specific number of CUs for the Task Resource Usage.

    • For Source, select the MaxCompute data source that you added.

    • For Destination, select the ClickHouse data source that you added.

    • For Resource Group, select the resource group that is connected to both the MaxCompute and ClickHouse data sources. You can allocate a specific number of CUs for the task resource Usage.

  2. On the Source and Destination cards, click Test Connectivity.

    image

  3. After you confirm that the connections to the source and destination are successful, click Next.

3. Configure the source and destination

Configure source (MaxCompute) parameters

The following are the key parameters for configuring the source MaxCompute table.

image

Configuration item

Configuration details

Data Source

The MaxCompute data source that you selected in the previous step is displayed by default. If your DataWorks workspace is in standard mode, the names of the development and production projects are displayed separately.

Tunnel Resource Group

This tutorial uses the Public Transmission Resource by default. If you have an exclusive Tunnel Quota, you can select it from the drop-down list.

Table

Select the MaxCompute table to synchronize. If your DataWorks workspace is in standard mode, make sure that a MaxCompute table with the same name and schema exists in both the development and production environments of MaxCompute.

Note

If:

  • If the table to be synchronized does not exist in the development environment, the table cannot be found in the drop-down list for the source table of the offline synchronization node.

  • If the table to be synchronized does not exist in the production environment, the data synchronization task fails after it is submitted and published because the task cannot find the table at runtime.

  • If the table schemas in the development and production environments are inconsistent, the column mappings at runtime may differ from the mappings configured for the offline synchronization node. This can lead to incorrect data writes.

Filter Method

You can filter data using Partition Filter or Data Filtering.

  • Partition Filter: Filters the synchronization scope of the source data using a partition filter expression. If you select this method, you must also configure the Partition Info and When Partition Does Not Exist parameters.

  • Data Filtering: Specifies the synchronization scope of the source data using a SQL WHERE clause. You do not need to enter the WHERE keyword.

Partition Info

This parameter is required when you set Filter Method to Partition Filter. You can enter the value of the partition key column.

  • The value can be a static field, such as ds=20220101.

  • The value can be a system parameter, such as ds=${bizdate}. The system parameter is automatically replaced with its value at runtime.

When Partition Does Not Exist

The policy to handle the synchronization task when a partition does not exist. The following options are supported:

  • Error.

  • Ignore Non-existent Partitions And Run The Task As Normal.

Configure destination (ClickHouse) parameters

The following are the key parameters for configuring the destination ClickHouse table.

image

Configuration item

Configuration details

Data Source

The ClickHouse data source that you selected in the previous step is displayed by default.

Table

Select the ClickHouse table to synchronize. We recommend that you ensure the table to be synchronized has the same table schema in the development and production environments of the ClickHouse data source.

Note

This section displays the list of tables from the development environment of the ClickHouse data source. If the table definitions in your development and production environments for the ClickHouse data source are different, the task may report that the table or a column does not exist after being submitted to the production environment, even if it was configured correctly in the development environment.

Primary Key Or Unique Key Conflict Handling

When you select insert into, the integration task checks for primary key uniqueness and constraint violations at runtime.

Pre-SQL Statement

You can execute SQL statements as needed before and after the data synchronization task runs. For example, before synchronizing data by day, you can clear the data in the corresponding daily partition to ensure that the partition is empty before the data is written.

Post-import Statements

Batch Size (Bytes)

Data is written to ClickHouse in batches. These parameters specify the upper limits for the batch size in bytes and the number of records. When the read data reaches either the byte limit or the record count limit, a batch is considered complete. Each completed batch is then written to ClickHouse.

We recommend that you set Batch Size (Bytes) to 16777216 (16 MB). We recommend that you set Batch Size (Records) to a large value based on the size of a single record. This way, the batch write is triggered by the batch size in bytes.

For example, if a single record is 1 KB, set Batch Size (Bytes) to 16777216 (16 MB) and Batch Size (Records) to 20000 (which is greater than 16 MB/1 KB = 16384). In this case, the write is triggered by the batch size in bytes, and a write occurs every 16 MB.

Batch Size (Records)

When Batch Write To ClickHouse Is Abnormal

When a batch write to ClickHouse is abnormal, you can select an error handling policy:

  • Try Single-record Writes. If A Single-record Write Still Fails, Count It As Dirty Data: Single-record writes can identify a small amount of dirty data but have a significant impact on the ClickHouse payload. Select this option with caution.

  • Fail And Exit The Synchronization Task: For ClickHouse stability, you can choose to have the task fail and exit when a write error occurs. Configure alerts to be notified promptly and handle the exceptions manually.

  • Count The Batch As Dirty Data: If you want to ensure the task completes and can tolerate the loss of batch data, you can choose to count the batch as dirty data. Use the number of allowed dirty data records to control whether the task exits.

4. Configure field mappings

After you select the source and destination, you must specify the column mappings between them. You can select Map By Name, Map By Position, Unmap, or Auto Layout.

5. Configure channel control

Offline sync tasks support settings such as Task Concurrency and Dirty Data Policy. In this tutorial, the Dirty Data Policy is set to Do Not Tolerate Dirty Data, and the other configurations use the default settings. For more information, see Configure an offline sync task in codeless UI.

6. Configure and run the debug task

  1. On the right side of the offline sync node editing page, click Run Configuration. Set the Resource Group and Script Parameters for the debug run. Then, click Run in the top toolbar to test whether the sync link runs successfully.

  2. In the navigation pane on the left, click the image icon. To the right of Personal Directory, click the image icon and create a file with a .sql extension. Run the following SQL statement to query the data in the destination table and check whether it meets your expectations.

    Note
    SELECT * FROM <clickhouse_target_table_name> LIMIT 20;

7. Configure scheduling and publish the node

Click Scheduling Configuration on the right side of the offline sync task page. After you configure the scheduling configuration parameters that are required for periodic runs, click Publish in the top toolbar and follow the on-screen instructions to complete the publishing process.

Appendix: Adjust memory parameters

If the sync speed does not increase significantly after you increase the concurrency, you can manually adjust the memory parameters of the sync task. Perform the following steps:

  1. In the toolbar at the top of the offline sync task page, click Code Editor to switch the task from the codeless UI to the code editor.

    image

  2. In the setting section of the JSON script, add the jvmOption parameter, in the format -Xms${heapMem} -Xmx${heapMem} -Xmn${newMem}.

    image

In the codeless UI, the system calculates the value of ${heapMem} using the formula: 768 MB + (concurrency - 1) × 256 MB. We recommend that you set ${heapMem} to a larger value in the code editor and set ${newMem} to one-third of the value of ${heapMem}. For example, if the concurrency is 8, the default value calculated for ${heapMem} in the codeless UI is 2560 MB. In the code editor, you can set a larger value. For example, set jvmOption to -Xms3072m -Xmx3072m -Xmn1024m.