Load MaxCompute Data into ClickHouse with Batch Tasks - DataWorks

This tutorial walks you through batch-synchronizing data from a MaxCompute table to ApsaraDB for ClickHouse using Data Integration in DataStudio. It covers data source configuration, network connectivity, task parameter tuning, and scheduling.

ApsaraDB for ClickHouse overview

ApsaraDB for ClickHouse is a column-oriented database designed for Online Analytical Processing (OLAP). Data Integration supports synchronizing data from ApsaraDB for ClickHouse to other destinations and from other sources to ApsaraDB for ClickHouse.

Limitations

Single-table batch synchronization is supported only for ApsaraDB for ClickHouse.

Prerequisites

Before you begin, ensure that you have:

A Serverless resource group
A MaxCompute data source and an ApsaraDB for ClickHouse data source configured in Data Integration. See Data source configuration.
Network connectivity between the resource group and both data sources. See Network connectivity solution.

Synchronize a MaxCompute table to ClickHouse

The following steps use the DataStudio (new version) UI.

Step 1: Create a node

For general steps on creating and configuring a node in the codeless UI, see Configure a node in the codeless UI.

Step 2: Configure the source and destination

In the data source and resource sections, set Source to your MaxCompute data source and Destination to your ApsaraDB for ClickHouse data source. Select a Resource Group and test the connectivity.

Source (MaxCompute) parameters

Parameter	Description
Tunnel Resource Group	Defaults to Public Transmission Resource. If you have an exclusive Tunnel Quota, select it from the drop-down list.
Table	The MaxCompute table to synchronize. In a standard DataWorks workspace, the table must exist in both the development and production environments with the same name and schema. If the table is missing from development, it won't appear during configuration. If it's missing from production, the task fails after publishing. If schemas differ between environments, field mappings may be inconsistent, causing incorrect data to be written.
Filtering method	Supports Partition Filtering and Data Filter.
Partition	Required when Filtering method is set to Partition Filtering. Accepts a fixed value (e.g., `ds=20220101`), a scheduling parameter (e.g., `ds=${bizdate}`), or multiple partitions. When the task runs, scheduling parameters are automatically replaced with their corresponding values.
If partitions do not exist	Applies when Filtering method is set to Partition Filtering. Valid values: Error or the partitions are ignored and tasks are normally run.
Data Filter	Specifies source data using a SQL WHERE clause. Do not include the WHERE keyword.

Destination (ApsaraDB for ClickHouse) parameters

Parameter	Description
Table	The ApsaraDB for ClickHouse table to write data to. The table list is pulled from the development environment. If table definitions differ between development and production, the task may fail after publishing with a "table or column not found" error. Keep table schemas identical across environments.
Primary or Unique Key Conflict Handling	When `insert into` is selected, the integration task validates primary key and unique key constraint conflicts at runtime.
Statement Run Before Writing	SQL statements to run before the data sync starts. Use this to clear a daily partition before writing new data, for example.
Statement Run After Writing	SQL statements to run after the data sync completes.
Batch Insert Size (Bytes) and Data Records Per Write	Set Batch Insert Size (Bytes) to `16777216` (16 MB) as the primary trigger for batch writes. Set Data Records Per Write to a value greater than `Batch Insert Size (Bytes)` divided by your single record size in bytes, so that byte size — not record count — triggers each write. For example, if a single record is 1 KB, set Data Records Per Write to `20000` (which exceeds 16 MB / 1 KB = 16,384).
If a batch write fails	Determines how the task handles batch write errors. See Batch write failure options below.

Batch write failure options

Choose based on your tolerance for ClickHouse load impact and data completeness:

Option	Behavior	When to use
Try single-row write. If it still fails, record it as dirty data.	Retries each record individually to isolate dirty data. Single-row writes significantly increase ClickHouse load.	Use with caution — only when identifying individual bad records is critical.
Exit on Sync Task Failure	Stops the task immediately when a write error occurs.	Use when ClickHouse stability is the priority. Configure alerts so you're notified and can intervene manually.
Count batch write failures as dirty data.	Marks failed batches as dirty data and lets the task run to completion. Use the dirty data limit to control task exit behavior.	Use when task completion matters more than preserving every batch.

Step 3: Configure field mapping

After selecting a source and destination, map source columns to destination columns. Use one of the following options:

Map Fields with the Same Name
Map Fields in the Same Line
Delete All Mappings
Auto Layout

Step 4: Configure advanced settings

In Advanced Settings, configure Expected Maximum Concurrency, Policy for Dirty Data Records, and other options. This tutorial sets Policy for Dirty Data Records to Disallow Dirty Data Records and leaves all other settings at their default values. For parameter details, see Codeless UI configuration.

Step 5: Run and verify

Click Run Configuration on the right side of the Batch Synchronization node edit page. Configure the Resource Group and Script Parameters for the debug run, then click Run in the top toolbar to test the synchronization link.
After the run completes, verify the data in the destination table:
1. Click in the left navigation bar, then click the new icon to the right of Personal Directory to create a SQL file.
2. Run the following query:
```
SELECT * FROM <your_clickhouse_destination_table_name> LIMIT 20;
```
3. Click Run Configuration on the right side of the .sql file editing page. Specify the data source Type, Computing Resources, and Resource Group, then click Run.
Note
To query data this way, first bind ApsaraDB for ClickHouse as a computing resource in DataWorks.

Step 6: Schedule and publish

Click Scheduling Settings to the right of the offline synchronization task. After configuring the Scheduling Configuration parameters, click Publish in the top toolbar and follow the prompts to complete the publication.

Appendix: Adjust memory parameters

If increasing concurrency doesn't improve synchronization throughput, manually increase the JVM heap memory for the task.

Click Code Editor on the top toolbar of the Offline Synchronization Task page to switch from the codeless UI to the code editor.
In the setting section of the JSON script, add the jvmOption parameter using the format -Xms${heapMem} -Xmx${heapMem} -Xmn${newMem}.

The codeless UI calculates ${heapMem} as 768 MB + (concurrency level - 1) x 256 MB. In the code editor, set ${heapMem} to a larger value and set ${newMem} to one-third of ${heapMem}.

Example: At concurrency level 8, the default ${heapMem} is 2,560 MB. To increase it, set jvmOption to -Xms3072m -Xmx3072m -Xmn1024m.