This topic describes how to use the data integration tool of DataWorks to import data to ApsaraDB for ClickHouse.

Background information

The data integration tool allows you to configure data import tasks in script mode and wizard mode. In script mode, you can configure import and export logic by using an XML file. In wizard mode, you can configure import and export logic by using the visual interface. The wizard mode is not supported on ApsaraDB for ClickHouse. The following section describes the procedure to synchronize data by using DataWorks:

Step 1: Create a data source connection

In most cases, when you use DataWorks to synchronize data, you must establish data source connections on both the source and destination databases. ApsaraDB for ClickHouse supports only the script mode. Therefore, you need to configure a data source connection only for the source database, but not for the ApsaraDB for ClickHouse cluster. Perform the following operations:

  1. Go to the Data Integration page of a workspace.
  2. Create a data source connection. For more information, see Add a data source.
Note You need to configure a data source only for the source database, but not for ApsaraDB for ClickHouse.

Step 2: Create a data synchronization task

On the Data Integration page, create a data synchronization task. For more information, see Create a sync node.

The following figures show the detailed operations. 156618_en01
Note ApsaraDB for ClickHouse supports only the script mode, in which you can configure data synchronization logic by using an XML file. Click Switch to the code editor to change to the script mode.

Step 3: Configure data synchronization parameters

Data synchronization logic is configured in the XML file and a reader plug-in indicates a data source. For more information about configuring reader plug-ins, see Configure the reader.

Import data to ApsaraDB for ClickHouse. The following example demonstrates how to configure a writer in a data synchronization task:

{
    "job": {
        "setting": {
            "speed": {
                "byte":10485760
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        },
        "content": [
            {
                "reader": {
                    "name": "streamreader",
                    "parameter": {
                        "column" : [
                            {
                                "value": "DataX",
                                "type": "string"
                            },
                            {
                                "value": 19890604,
                                "type": "long"
                            },
                            {
                                "value": "1989-06-04 00:00:00",
                                "type": "date"
                            },
                            {
                                "value": 123.123,
                                "type": "double"
                            }
                        ],
                        "sliceRecordCount": 100000
                    }
                },
                "writer": {
                    "name": "clickhousewriter",
                    "parameter": {
                        "username": "test",
                        "password": "test",
                        "column": ["str_col", "long_col", "create_time", "double_col"],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:clickhouse://11.157.212.86:3001",
                                "table": ["test_tbl"]
                            }
                        ],
                        "preSql": [],
                        "postSql": [],

                        "batchSize": 65536,
                        "batchByteSize": 134217728,
                        "dryRun": false,
                        "writeMode": "insert"
                    }
                }
            }
        ]
    }
}
The parameters in the example:
  • jdbcUrl: Typically, the connection string of a VPC is used to ensure data transmission speed. If the Internet is used, the data transmission speed may be limited by the network bandwidth. The connection string can contain the database name. Example: jdbc:clickhouse://<host>:<port>/<db_name>. The port number in the connection string must be the HTTP port of the ApsaraDB for ClickHouse cluster.
  • BatchSize: the maximum number of rows in a batch. Batching can significantly improve the insertion speed.
  • batchByteSize: the maximum number of bytes in a batch. If both batchByteSize and batchSize are specified, the value that is reached first takes effect.
  • preSql: the SQL statement to execute before the synchronization task is performed.
  • postSql: the SQL statement to execute immediately after the synchronization task is performed.

Step 4: Configure scheduling parameters

Like ETL tasks, you must configure scheduling parameters for data synchronization tasks in script mode. The major parameters include task dependencies and scheduling cycles. For more information, see Configure scheduling parameters.

Step 5: Create an exclusive resource group

For the ApsaraDB for ClickHouse cluster, you must create an exclusive resource group to schedule a task. Click Resource Group configuration to go to the Resource Group configuration page, as shown in the following figure.

156618_en03

Click Create an exclusive data integration Resource Group to go to the Exclusive Resource Groups tab. Click Create an exclusive data integration Resource Group and configure parameters to create an exclusive resource group. After the resource group is created, the status of the resource group changes to Running. Then, click Modify Workspace to bind the resource group to the workspace where the data synchronization task is located.

156618_en04

The following figure shows how to create an exclusive resource group. Select Exclusive Resource Groups, enter basic information such as resource group name and description. Click Pay to purchase the resource group. After the resource group is purchased, select the latest order in the Select Order Number field. In the Zone field, select the zone where the resource group is located.

On the buy page, select the region and zone of the ApsaraDB for ClickHouse cluster, select Exclusive Resource Groups. Select appropriate resource specifications based on the number of data synchronization tasks and data volume. Submit the order.

After the exclusive resource group is created, you must bind it to a VPC. This is essential for connecting to the exclusive resource group. Bind the exclusive resource group to the same VPC and vSwitch as the ApsaraDB for ClickHouse cluster.

15661805

Step 6: Publish the data synchronization task

After an exclusive resource group is created and bound to a workspace, you can publish the data synchronization task on the Resource Group configuration page. Click Save and Submit to publish the task, as shown in the following figure.

156618_en_04

For more information about how to run and troubleshoot a data synchronization task, see the general O&M configurations of data synchronization or ETL tasks in DataWorks and Run a node and troubleshoot errors.

Troubleshoot the slow write issue.
Possible causes for the slow write issue:
  • Parameters are inappropriately configured. ApsaraDB for ClickHouse is suitable for writing data in a large batch size and low concurrency. In most cases, the batch size can be tens of thousands or even hundreds of thousands (depending on your single-row RowSize size, generally evaluated at 100Byte per row, and you need to estimate based on actual data characteristics). We recommend that you do not specify more than ten concurrent write operations. You can modify parameters and try again.
  • The exclusive resource group uses low ECS instance specifications. For example, a low value of the CPU or memory for the exclusive resource group may cause low concurrency or outbound bandwidth. A high batch size but low memory may cause Java garbage collection of DataWorks processes. You can check DataWorks output logs.
  • Data is read slowly from the data source. You can search for totalWaitReaderTime and totalWaitWriterTime in DataWorks output logs. If the totalWaitReaderTime value is significantly greater than the totalWaitWriterTime value, the read operation is slow.
  • A public endpoint is used. The bandwidth of a public endpoint is limited and cannot carry high-performance data import and export. Replace the public endpoint with a VPC endpoint.
Possible causes for the failed connection issue:
  • No IP whitelist is added for the ApsaraDB for ClickHouse cluster.
  • The exclusive resource group is not bound to the VPC.