×
Community Blog Using DTS to Synchronize Data from MySQL to ClickHouse

Using DTS to Synchronize Data from MySQL to ClickHouse

This article introduces the best practices and common pitfalls when using Alibaba Cloud DTS for data migration from MySQL to ClickHouse.

Introduction

In the current trend of enterprise digital transformation, there is an increasing need for online statistical analysis of extensive business data, which traditional databases often fail to provide in terms of performance and real-time analysis. With the growth of the ClickHouse community, more developers are looking to speed up analysis by syncing data from MySQL to ClickHouse. After testing the MaterializedMySQL engine released by ClickHouse, I encountered several limitations:

  • For ApsaraDB for ClickHouse Community Edition, only clusters of version 20.8 and later support the MaterializeMySQL engine.
  • The synchronization process cannot be visualized. The synchronization task is displayed without statuses. You cannot directly view the synchronization performance and latency.
  • Tables without primary keys are not supported by MaterializeMySQL engine.
  • The data source ApsaraDB RDS for MySQL cluster and the destination ApsaraDB for ClickHouse cluster must be in the same VPC network.
  • ETL is not supported. Object name mapping and appended columns are also not supported.
  • Data filtering is not supported. You cannot select which DML or DDL operations are to be synchronized. You cannot use filters to select part of the data in a table.
  • The synchronization task cannot be paused but can only be deleted and reconfigured. After the task is deleted, the databases and tables synchronized to the destination database are also deleted. Resumable upload is not supported.
  • Synchronization of certain DDL operations is not supported, such as creating a table, modifying the comment of a table, and modifying the comment of a column.

To overcome these restrictions, Data Transmission Service (DTS) team has developed a synchronization link from MySQL to ClickHouse. I will guide you through how to use DTS to smoothly and reliably sync data from MySQL to ClickHouse.

Advantages of Using Alibaba Cloud DTS for MySQL-to-ClickHouse Synchronization

To address the problems of using the MaterializedMySQL engine, Alibaba Cloud DTS has developed a synchronization link between MySQL and ClickHouse. Compared with MaterializedMySQL, DTS for MySQL-to-ClickHouse synchronization has the following advantages:

  • High compatibility: The destination database is compatible with ApsaraDB for ClickHouse Community Edition.
  • Visualized synchronization process: You can view the synchronization process in the Alibaba Cloud DTS console. For example, you can check whether the synchronization task is in the process of schema migration, full migration, or incremental migration. You can also view the synchronized RPS and BPS and check whether the synchronization task fails or has latency.
  • Support for migration of tables without primary keys.
  • Flexible selection of VPC networks: Alibaba Cloud DTS is a third-party migration tool. Therefore, the data source ApsaraDB RDS for MySQL cluster and the destination ApsaraDB for ClickHouse cluster may belong to different VPC networks.
  • Support for features such as ETL, object name mapping, and appended columns.
  • Support for data filtering. You can synchronize only the data of a table with specified conditions. You can also synchronize only certain DML or DDL operations.
  • You can pause and restart a synchronization task at any time. Resumable upload is supported. The destination databases and tables are not deleted after the task is deleted.
  • Support for synchronization of common DDL operations, such as creating a table, altering a table, truncating a table, and dropping a table.

When using Alibaba Cloud DTS to synchronize data from MySQL to ClickHouse, beginners may encounter challenges during preparation, task creation, and ongoing maintenance. This article aims to provide best practices and address common pitfalls when using Alibaba Cloud DTS for migrating data from MySQL to ClickHouse, ensuring a worry-free data migration experience and efficient data analysis with ClickHouse.

Learning to Use Alibaba Cloud DTS for MySQL-to-ClickHouse Synchronization

When you are using Alibaba Cloud DTS to synchronize data from MySQL to ClickHouse, you may encounter a series of issues during the preparation, task creation, and O&M. This article compiles best practices and common errors to assist you in using Alibaba Cloud DTS for data migration with ease and to facilitate an efficient data analysis experience with ClickHouse.

1 Best Practices

1.1 Add Permissions to a Database Account

When handling ClickHouse link tickets, insufficient permissions of the database account may cause DTS tasks to fail. During the precheck, DTS verifies the permissions of the database account used to configure the task. If the verification fails, the task cannot be initiated. Before configuring a task, you need to grant the following permissions to the source and destination database accounts:

Database Schema Synchronization Full Synchronization Incremental Synchronization
MySQL The SELECT permission The SELECT permission The SELECT permission on the objects to be synchronized
REPLICATION CLIENT, REPLICATION SLAVE, and SHOW VIEW
The permissions to create databases and tables, which allow DTS to create a database named dts to record heartbeat data during data synchronization.
ClickHouse The read and write permissions.

1.2 Try to Use the Alibaba Cloud DTS Console to Create Tasks

DTS MySQL-to-ClickHouse links have certain requirements for dblist. If you use the DTS console to configure tasks, you will get a standardized dblist. If you need to use DTS OpenAPIs to configure tasks for business reasons, you must specify dblist in the following format:

dbList={
  "source_db_name": {
    "name": "target_db_name",
    "all": false,
    "state": "open",
    "Table": {
      "source_table_name": {
        "name": "target_table_name",
        "all": true,
        "primary_key": "id",
        "message_key": "id",
        "partition_key": "sipHash64(id)",
        "part_key": "id",
        "type": "partition"
      }
    },
    "num": 1
  }
}

Among them,

  • primary_key refers to the primary key,
  • message_key refers to the sort key,
  • part_key refers to the partition key,
  • and partition_key refers to the distribution key, that is, the shardkey.

1.3 Try to Use DTS for Schema Migration

The schema of ClickHouse has a significant impact on the data writing performance and stability. When you use DTS to migrate data from MySQL to ClickHouse, you must use the schema migration feature provided by DTS to avoid errors caused by schemas that are not supported by DTS.

If you need to create a custom schema for business reasons, it must meet the following requirements of DTS:

  • For ClickHouse Community Edition, you must create a local table and a distributed table. The name of the distributed table must be the same as that of the destination table in dblist. The name of the local table must be distributed table name_local.
  • You must add two appended columns: _sign and _version. The following is the detailed description:
Column Data Type Default Value Description
_sign Int8 1 Record the type of DML operations:
Insert: The value is 1.
Delete: The value is -1.
Update: The value is 1.
_version UInt64 1 The timestamp when data is written to ClickHouse.
  • DTS supports only RepladingMergeTree engines. The specific requirements are as follows:
CK Type Cluster Type Whether the Source Has a Primary Key Destination CK Engine
Community Edition Single Replica Yes MergeTree
No ReplacingMergeTree
Multiple Replicas Yes ReplicatedMergeTree
No ReplicatedReplacingMergeTree

1.4 Select the Appropriate Partition Key

When you configure a MySQL-to-ClickHouse migration task, you can specify a column as the partition key to separate data. ClickHouse creates a new file directory for each key value. This is a data management trick that allows users to logically separate data in a table. For example, separate data by day. DROP PARTITION allows you to quickly delete a data subset. However, if you select an inappropriate partition key, the number of parts created by ClickHouse may exceed the limits of the parts_to_throw_insert and max_parts_in_total parameters. In this case, the "Too many inactive parts (N). Parts cleaning are processing significantly slower than inserts" error may occur.

To avoid the preceding issue, the DTS schema migration has the following optimizations:

  • If you do not specify a partition key, no partition by xxx statement is added when you create a table for schema migration, that is, no partition is performed.
  • If you specify a partition key, the specified column is used as the partition key.

    • If the source type of the specified partition key is BIGINT, the partition by intDiv(XXX, 18014398509481984) statement is added when you create a table.
    • If the source type of the specified partition key is TinyInt, SmallInt, or MediumInt, the partition by intDiv(XXX, 4194304) statement is added when you create a table.
    • Otherwise, if the source type is time (date, datetime, and timestamp), the partition by toYYYMM(update_time) statement is added when you create a table.
    • If the source type is others, no partition by xxx statement is added when you create a table, that is, no partition is performed.

If you do not use DTS to migrate the tables you created, you must select an appropriate partition key to avoid the "Too many inactive parts" error during data writing to DTS.

1.5 Select an Appropriate Shard Key

During data writing, the distributed table distributes data to the local table of each host node based on the rules of the shard key. The shard key is required to return a value of integer type, which can be a specific integer column field. If the shard key value is not of integer type, an error is reported when a table is created for schema migration. For example, if you use the String type as the shard key, a message Sharding expression has type String, but should be one of integer type. is reported.

To avoid the preceding issue, the DTS schema migration is optimized as follows:

  • If you do not specify a shard key, rand() is used as the shard key.
  • If you specify the shard key,

    • If the shard key is multiple columns, sipHash64(A, B, C) is used as the shard key.
    • If the shard key is a single column and is of integer type, the column is used as the shard key.
    • If the shard key is a single column and is not of integer type, sipHash64(A) is used as the shard key.

2 Trials and Errors

2.1 What to Do When Data is Duplicated?

ClickHouse does not require a unique primary key. This means that even if multiple rows of data have the same primary key, ClickHouse allows the data to be written. DTS may introduce duplicate data in the following two scenarios:

  • During full migration, DTS divides the data in a table into several slices to concurrently pull and write data. When DTS is migrating data from a slice, you can pause the task and then restart the task, and DTS re-migrates data from the current slice to the destination. This will cause part of the data in this slice to be duplicated.
  • During incremental migration, as mutations are heavy operations, DTS uses the same solution as the MaterializeMySQL engine, that is, updating/deleting versioning based on the ReplacingMergeTree engine. When you update data, an identical piece of data is inserted and _sign is set to 1. When you delete data, an identical piece of data is inserted and _sign is set to -1. When you merge partitions, the ReplacingMergeTree deletes duplicate data and retains the row with the largest _version value in the same group of duplicate data.

Based on the preceding principles, we can use the following methods to remove duplicate data:

  • Execute the optimize table xxx statement to force ClickHouse to merge partitions to remove duplicate data.
  • Add final when querying data. If you want to filter deleted data, add _sign>0. For example, select * from table_name final where _sign>0;

2.2 Why Is the Data of DateTime Type Inconsistent with the Source?

The time types of ClickHouse (Date, Date32, DateTime, and DateTime64) are based on Unix timestamps. The following are the time ranges of the DateTime, DateTime64, Date32, and Date data types in ApsaraDB for ClickHouse. If the time in ApsaraDB RDS for MySQL is not within these ranges, it will be converted to these ranges. Therefore, the time in ClickHouse may be inconsistent with that in MySQL.

ClickHouse Data Type Minimum Value Maximum Value
DateTime64 1925-01-01 08:00:00 2283-11-12 07:59:59
DateTime 1970-01-01 08:00:00 2106-02-07 14:28:15
Date32 1925-01-01 2283-11-11
Date 1970-01-01 2149-06-06


Recommended Spec & Free Trial for ApsaraDB for ClickHouse:
https://www.alibabacloud.com/product/clickhouse

0 1 0
Share on

ApsaraDB

377 posts | 57 followers

You may also like

Comments

ApsaraDB

377 posts | 57 followers

Related Products