All Products
Search
Document Center

ApsaraDB for ClickHouse:Synchronize data from DataWorks

Last Updated:Jan 30, 2024

This topic describes how to synchronize data from DataWorks to ApsaraDB for ClickHouse.

Background information

You can use the batch data synchronization feature of DataWorks to synchronize data from various data sources to ApsaraDB for ClickHouse. For more information about the data sources supported by batch data synchronization, see Supported data source types, Reader plug-ins, and Writer plug-ins.

Prerequisites

Usage notes

  • ApsaraDB for ClickHouse supports only exclusive resource groups for Data Integration.

  • If you want to synchronize a table that has been synchronized before, run TRUNCATE TABLE <Table name>; to clear synchronized data in the ApsaraDB for ClickHouse table.

Procedure

  1. Configure data sources.

    You must configure data sources for MaxCompute and ApsaraDB for ClickHouse.

    Note

    For more information, see Add a MaxCompute data source and Add a ClickHouse data source.

  2. Create a MaxCompute table.

    1. Log on to the DataWorks console.

    2. In the left-side navigation pane, click Workspaces.

    3. In the top navigation bar, select the region in which your workspace resides. Find the workspace and click DataStudio in the Actions column.

    4. On the DataStudio page, move the pointer over the 新建 icon and choose Create Table > MaxCompute > Table.

    5. In the Create Table dialog box, select a path from the Path drop-down list and configure the Name parameter. In this example, odptabletest1 is used as the table name. Click Create.

    6. In the General section, configure the parameters.

      基本属性

      The following table describes the parameters.

      ParameterDescription
      Display NameThe display name of the table.
      ThemeThe folders that are used to store and manage the table. You can specify the level-1 and level-2 folders to store the table. The Level-1 Theme and Level-2 Theme parameters can be used to categorize tables based on business categories. You can store tables of the same business category in the same folder.
      Note The level-1 and level-2 themes in the Workspace Tables pane of the DataStudio page help you better manage tables in folders. You can quickly find the current table in the Workspace Tables pane by theme. If no theme is available, you can create one. For information about how to create a theme, see Create or manage folders for tables.
    7. Click DDL in the toolbar.

    8. In the DDL dialog box, enter the following statements and click Generate Table Schema:

      CREATE TABLE IF NOT EXISTS odptabletest1
      (    
      v1  TINYINT,    
      v2  SMALLINT                
      );
    9. Click Commit to Development Environment and Commit to Production Environment in sequence.

  3. Write data into the MaxCompute table.

    1. On the DataStudio page, click Ad Hoc Query in the left-side navigation pane.

    2. Move the pointer over the 新建 icon and choose Create > ODPS SQL.

    3. In the Create Node dialog box, select a path from the Path drop-down list and configure the Name parameter.

    4. Click Commit.

    5. On the edit page of the node, enter the following statements to write data into the MaxCompute table.

      insert into odptabletest1 values (1,"a"),(2,"b"),(3,"c"),(4,"d");
    6. Click the 执行 icon in the toolbar.

    7. In the Estimate MaxCompute Computing Cost dialog box, click Run.

  4. Create an ApsaraDB for ClickHouse table.

    1. Log on to the ApsaraDB for ClickHouse console.

    2. In the upper-left corner, select the region where the cluster that you want to manage is deployed.

    3. On the Clusters page, click the Default Instances tab. Find the cluster that you want to manage and click the ID of the cluster.

    4. On the Cluster Information page, click Log On to Database in the upper-right corner.

    5. In the Log on to Database Instance dialog box, enter the username and password of your database account and click Login.

    6. Enter the following statements and click Execute(F8). Sample statements:

      create table default.dataworktest ON CLUSTER default (
      v1 Int, 
      v2 String
      ) ENGINE = MergeTree ORDER BY v1;
      Note

      The structure type of the ApsaraDB for ClickHouse table must correspond to that of MaxCompute.

  5. Create a workflow.

    If you already have a workflow, skip this step.

    1. On the DataStudio page, click Scheduled Workflow in the left-side navigation pane.

    2. Move the pointer over the 新建 icon and select Create Workflow.

    3. In the Create Workflow dialog box, configure the Workflow Name parameter.

      Important

      The name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).

    4. Click Create.

  6. Create a batch synchronization node.

    1. Click the newly created workflow and right-click Data Integration.

    2. Choose Create > Batch Synchronization.

    3. In the Create Node dialog box, configure the Name parameter, and select a path from the Path drop-down list.

      Important

      The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).

    4. Click Commit.

  7. Configure the data source and destination.

    1. Source: Select a supported data source. In this example, MaxCompute is selected.

      选择数据源

      The following table describes the parameters.

      Parameter

      Description

      Connection

      The type and name of the data source.

      Production Project Name

      The name of the project in the production environment. You cannot change the value.

      Table

      The table that you want to synchronize.

      Partition Key Column

      If your daily incremental data is stored in the partitions of a specific date, you can specify the partition information to synchronize the daily incremental data. For example, set v17 to ${bizdate}.

      Note

      DataWorks cannot map the fields in partitioned MaxCompute tables. If you want to read data from a partitioned MaxCompute table, you must specify each desired partition when you configure MaxCompute Reader.

      Note

      For more information about the parameters, see MaxCompute Reader.

    2. Target: Select ClickHouse.

      选择数据源

      The following table describes the parameters.

      Parameter

      Description

      Connection

      The type and name of the data source. Select ClickHouse.

      Table

      The table into which you want to import the synchronized data.

      Primary key or unique key conflict handling

      Set this value to insert into (Insert).

      Pre sql

      The SQL statement that you want to execute before the synchronization node is run.

      Post sql

      The SQL statement that you want to execute after the synchronization node is run.

      Batch insert byte size

      The maximum number of bytes to be inserted.

      Number of batches

      The number of data entries to be inserted in one batch.

    3. Mappings (optional): You can select the field mapping. The Field on the left side corresponds to the Field on the right side.

      字段映射

      Note

      For information about the parameters, see Map the fields in the source and destination tables.

    4. (Optional) Channel: Configure the maximum transmission rate and dirty data check rules.

      通道控制

      Note

      For information about the parameters, see Configure channel control policies.

  8. Configure the resource group for Data Integration.

    Click Resource Group configuration on the right and select a group from the Exclusive Resource Group drop-down list.集成资源组配置

  9. Run and save the synchronization task.

    1. Click the 保存 icon in the toolbar to save the synchronization task.

    2. Click the 运行 icon in the toolbar to run the synchronization task.image

Verify the data synchronization result

  1. Log on to the ApsaraDB for ClickHouse console.

  2. In the upper-left corner, select the region where the cluster that you want to manage is deployed.

  3. On the Clusters page, click the Default Instances tab. On the tab that appears, click the ID of the cluster that you want to manage.

  4. On the Cluster Information page, click Log On to Database in the upper-right corner.

  5. In the Log on to Database Instance dialog box, enter the username and password of your database account and click Login.

  6. Enter the following query statement and click Execute(F8) to check whether the data has been synchronized.

    SELECT * FROM dataworktest;

    The following result is returned.查询结果

    Note

    If the result is returned after you execute the query statement, the data has been synchronized from DataWorks to ApsaraDB for ClickHouse.