This topic describes how to synchronize data from DataWorks to ApsaraDB for ClickHouse.
Background information
You can use the batch data synchronization feature of DataWorks to synchronize data from various data sources to ApsaraDB for ClickHouse. For more information about the data sources supported by batch data synchronization, see Supported data source types, Reader plug-ins, and Writer plug-ins.
Prerequisites
A DataWorks workspace is created and MaxCompute is selected as the compute engine. For more information, see Create a workspace.
An exclusive resource group for Data Integration is created and configured. For more information, see Create and use an exclusive resource group for Data Integration.
An ApsaraDB for ClickHouse cluster is created, and a username and a password of a database account are created. For more information, see Create an ApsaraDB for ClickHouse cluster and Manage accounts.
You have the permissions to log on to the database. For more information, see Grant permissions.
If a RAM user wants to use DataWorks, the owner of the Alibaba Cloud account must assign a role to the RAM user. For more information, see Add workspace members and assign roles to them.
Usage notes
ApsaraDB for ClickHouse supports only exclusive resource groups for Data Integration.
If you want to synchronize a table that has been synchronized before, run
TRUNCATE TABLE <Table name>;
to clear synchronized data in the ApsaraDB for ClickHouse table.
Procedure
Configure data sources.
You must configure data sources for MaxCompute and ApsaraDB for ClickHouse.
NoteFor more information, see Add a MaxCompute data source and Add a ClickHouse data source.
Create a MaxCompute table.
Log on to the DataWorks console.
In the left-side navigation pane, click Workspaces.
In the top navigation bar, select the region in which your workspace resides. Find the workspace and click DataStudio in the Actions column.
On the DataStudio page, move the pointer over the icon and choose .
In the Create Table dialog box, select a path from the Path drop-down list and configure the Name parameter. In this example, odptabletest1 is used as the table name. Click Create.
In the General section, configure the parameters.
The following table describes the parameters.
Parameter Description Display Name The display name of the table. Theme The folders that are used to store and manage the table. You can specify the level-1 and level-2 folders to store the table. The Level-1 Theme and Level-2 Theme parameters can be used to categorize tables based on business categories. You can store tables of the same business category in the same folder. Note The level-1 and level-2 themes in the Workspace Tables pane of the DataStudio page help you better manage tables in folders. You can quickly find the current table in the Workspace Tables pane by theme. If no theme is available, you can create one. For information about how to create a theme, see Create or manage folders for tables.Click DDL in the toolbar.
In the DDL dialog box, enter the following statements and click Generate Table Schema:
CREATE TABLE IF NOT EXISTS odptabletest1 ( v1 TINYINT, v2 SMALLINT );
Click Commit to Development Environment and Commit to Production Environment in sequence.
Write data into the MaxCompute table.
On the DataStudio page, click Ad Hoc Query in the left-side navigation pane.
Move the pointer over the icon and choose .
In the Create Node dialog box, select a path from the Path drop-down list and configure the Name parameter.
Click Commit.
On the edit page of the node, enter the following statements to write data into the MaxCompute table.
insert into odptabletest1 values (1,"a"),(2,"b"),(3,"c"),(4,"d");
Click the icon in the toolbar.
In the Estimate MaxCompute Computing Cost dialog box, click Run.
Create an ApsaraDB for ClickHouse table.
Log on to the ApsaraDB for ClickHouse console.
In the upper-left corner, select the region where the cluster that you want to manage is deployed.
On the Clusters page, click the Default Instances tab. Find the cluster that you want to manage and click the ID of the cluster.
On the Cluster Information page, click Log On to Database in the upper-right corner.
In the Log on to Database Instance dialog box, enter the username and password of your database account and click Login.
Enter the following statements and click Execute(F8). Sample statements:
create table default.dataworktest ON CLUSTER default ( v1 Int, v2 String ) ENGINE = MergeTree ORDER BY v1;
NoteThe structure type of the ApsaraDB for ClickHouse table must correspond to that of MaxCompute.
Create a workflow.
If you already have a workflow, skip this step.
On the DataStudio page, click Scheduled Workflow in the left-side navigation pane.
Move the pointer over the icon and select Create Workflow.
In the Create Workflow dialog box, configure the Workflow Name parameter.
ImportantThe name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
Click Create.
Create a batch synchronization node.
Click the newly created workflow and right-click Data Integration.
Choose .
In the Create Node dialog box, configure the Name parameter, and select a path from the Path drop-down list.
ImportantThe node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
Click Commit.
Configure the data source and destination.
Source: Select a supported data source. In this example, MaxCompute is selected.
The following table describes the parameters.
Parameter
Description
Connection
The type and name of the data source.
Production Project Name
The name of the project in the production environment. You cannot change the value.
Table
The table that you want to synchronize.
Partition Key Column
If your daily incremental data is stored in the partitions of a specific date, you can specify the partition information to synchronize the daily incremental data. For example, set v17 to ${bizdate}.
NoteDataWorks cannot map the fields in partitioned MaxCompute tables. If you want to read data from a partitioned MaxCompute table, you must specify each desired partition when you configure MaxCompute Reader.
NoteFor more information about the parameters, see MaxCompute Reader.
Target: Select ClickHouse.
The following table describes the parameters.
Parameter
Description
Connection
The type and name of the data source. Select ClickHouse.
Table
The table into which you want to import the synchronized data.
Primary key or unique key conflict handling
Set this value to insert into (Insert).
Pre sql
The SQL statement that you want to execute before the synchronization node is run.
Post sql
The SQL statement that you want to execute after the synchronization node is run.
Batch insert byte size
The maximum number of bytes to be inserted.
Number of batches
The number of data entries to be inserted in one batch.
Mappings (optional): You can select the field mapping. The Field on the left side corresponds to the Field on the right side.
NoteFor information about the parameters, see Map the fields in the source and destination tables.
(Optional) Channel: Configure the maximum transmission rate and dirty data check rules.
NoteFor information about the parameters, see Configure channel control policies.
Configure the resource group for Data Integration.
Click Resource Group configuration on the right and select a group from the Exclusive Resource Group drop-down list.
Run and save the synchronization task.
Click the icon in the toolbar to save the synchronization task.
Click the icon in the toolbar to run the synchronization task.
Verify the data synchronization result
Log on to the ApsaraDB for ClickHouse console.
In the upper-left corner, select the region where the cluster that you want to manage is deployed.
On the Clusters page, click the Default Instances tab. On the tab that appears, click the ID of the cluster that you want to manage.
On the Cluster Information page, click Log On to Database in the upper-right corner.
In the Log on to Database Instance dialog box, enter the username and password of your database account and click Login.
Enter the following query statement and click Execute(F8) to check whether the data has been synchronized.
SELECT * FROM dataworktest;
The following result is returned.
NoteIf the result is returned after you execute the query statement, the data has been synchronized from DataWorks to ApsaraDB for ClickHouse.