Synchronize a self-managed MySQL to a self-managed Kafka using Data Transmission Service - Data Transmission Service - Alibaba Cloud - Data Transmission Service

Kafka is a distributed message queue service that features high throughput and high scalability. Kafka is widely used for big data analytics such as log collection, monitoring data aggregation, streaming processing, and online and offline analysis. It is important for the big data ecosystem. This topic describes how to synchronize data from a self-managed MySQL database connected over Express Connect, VPN Gateway, or Smart Access Gateway to a self-managed Kafka cluster by using Data Transmission Service (DTS). The data synchronization feature allows you to extend message processing capabilities.

Prerequisites

Your Kafka cluster must be a version from 0.10.1.0 to 2.7.0.
The engine version of the self-managed MySQL database is 5.1, 5.5, 5.6, 5.7, or 8.0.
The self-managed MySQL database must be connected to an Alibaba Cloud VPC. For instructions, see Connect to DTS from on-premises via CEN.

Usage notes

During initial full data synchronization, DTS consumes read and write resources from the source and destination databases, which increases the database load. If database performance is poor, instance specifications are low, or business traffic is heavy (for example, the source database has many slow SQL queries or tables without primary keys, or the destination database experiences deadlocks), the database load increases and may even cause the service to become unavailable. Before you synchronize data, evaluate the performance of your source and destination instances. We recommend performing data synchronization during off-peak hours, for example, when the CPU utilization of both instances is below 30%.
If a source table lacks a primary key or unique constraint, and no combination of its columns is unique, duplicate data may be written to the destination database.

Billing

Synchronization type	Pricing
Schema synchronization and full data synchronization	Free of charge.
Incremental data synchronization	Charged. For more information, see Billing overview.

Limitations

Only tables can be synchronized. Other object types are not supported.
Automatic adjustment of synchronization objects is not supported. If you rename a synchronized table and the new name is not included in the list of synchronization objects, data from that table is no longer synchronized to the destination Kafka cluster. To continue synchronizing the renamed table, you must Reselect Objects. For more information, see Add synchronization objects.

Supported synchronization topologies

One-way one-to-one synchronization
One-way one-to-many synchronization
One-way many-to-one synchronization
One-way cascade synchronization

Before you begin

Before you configure the synchronization task, you must create an account for a self-managed MySQL database and configure binary logging.

Procedure

Purchase a data synchronization task. For more information, see Purchase procedure.

Note When you purchase the task, set the source instance to MySQL, the destination instance to Kafka, and the synchronization topology to One-way Synchronization.
Log on to the DTS console.

Note
If you are automatically redirected to the Data Management (DMS) console, you can click the icon in the lower-right corner and then click to return to the classic DTS console.
In the left-side navigation pane, click Data Synchronization.
At the top of the Synchronization Tasks page, select the region where your destination instance is located.
Find the data synchronization task that you purchased and click Configure Task.

Configure the source and destination instances.

Section	Parameter	Description
None	Synchronization Task Name	The task name that DTS automatically generates. We recommend that you specify a descriptive name that makes it easy to identify the task. You do not need to use a unique task name.
Source Instance Details	Instance Type	Select User-Created Database Connected over Express Connect, VPN Gateway, or Smart Access Gateway.
	Instance Region	The source region that you selected on the buy page. The value of this parameter cannot be changed.
	Peer VPC	The ID of the VPC that is connected to the self-managed MySQL database.
	Database Type	The value is fixed to MySQL and cannot be changed.
	IP Address	The server IP address of the self-managed MySQL database.
	Port Number	The service port number of the self-managed MySQL database. Default value: 3306.
	Database Account	The account of the self-managed MySQL database. The account must have the SELECT permission on the required objects and the REPLICATION CLIENT, REPLICATION SLAVE, and SHOW VIEW permissions.
	Database Password	The password of the database account.
Destination Instance Details	Instance Type	Select the instance type based on where your Kafka cluster is deployed. This topic uses User-Created Database in ECS Instance as an example. Note If you select other instance types, you must perform additional preparations. For more information, see Preparation overview.
	Instance Region	The destination region that you selected on the buy page. The value of this parameter cannot be changed.
	ECS Instance ID	The ID of the Elastic Compute Service (ECS) instance on which the Kafka cluster is deployed.
	Database Type	Select Kafka.
	Port Number	The service port number of the Kafka cluster. Default value: 9092.
	Database Account	The username that is used to log on to the Kafka cluster. If no authentication is enabled for the Kafka cluster, you do not need to enter the username.
	Database Password	The password of the username. If no authentication is enabled for the Kafka cluster, you do not need to enter the password.
	Topic	Click Get Topic List and select a topic name from the drop-down list.
	Kafka Version	The version of the destination Kafka cluster.
	Encryption	Select Non-encrypted or SCRAM-SHA-256 based on your requirements.

In the lower-right corner of the page, click Set Whitelist and Next.

If the source or destination database is an Alibaba Cloud database instance, such as an ApsaraDB RDS for MySQL or ApsaraDB for MongoDB instance, DTS automatically adds the CIDR blocks of DTS servers to the IP address whitelist of the instance. If the source or destination database is a self-managed database hosted on an Elastic Compute Service (ECS) instance, DTS automatically adds the CIDR blocks of DTS servers to the security group rules of the ECS instance, and you must make sure that the ECS instance can access the database. If the self-managed database is hosted on multiple ECS instances, you must manually add the CIDR blocks of DTS servers to the security group rules of each ECS instance. If the source or destination database is a self-managed database that is deployed in a data center or provided by a third-party cloud service provider, you must manually add the CIDR blocks of DTS servers to the IP address whitelist of the database to allow DTS to access the database. For more information, see Whitelist DTS server IP addresses.

Warning
Adding the public IP address blocks of the DTS service, either automatically or manually, may pose security risks. Using this product, you acknowledge that you understand and accept the potential security risks and that you must implement basic security measures. These measures include, but are not limited to, strengthening password security, limiting the ports open to each CIDR block, using authentication for internal API calls, and regularly checking and restricting unnecessary CIDR blocks. Alternatively, you can connect through a private network using a leased line, VPN Gateway, or Smart Access Gateway.

Configure the synchronization objects.

Parameter	Description
Data Format in Kafka	Data synchronized to the Kafka cluster is stored in Avro or Canal JSON format. For more information, see Data formats in a message queue.
Policy for shipping data to Kafka partitions	Select the policy that meets your business requirements. For a detailed description, see Policies for synchronizing data to Kafka partitions.
Synchronization objects	In the Source Objects box, select the objects to synchronize (tables are the finest granularity), and then click the icon to move them to the Selected box. Note DTS automatically maps the table name to the topic name that you selected in Step 6. To change the destination topic for a table, use the object name mapping feature. For more information, see Set object names in the destination instance.
Object name mapping	Change the names of synchronized objects in the destination instance. For more information, see Map databases, tables, and columns.
Retry time for failed connections	If DTS cannot connect to the source or destination instance, it retries for 720 minutes (12 hours) by default. You can also specify a custom retry duration. If DTS reconnects to the source or destination instance within the specified duration, the synchronization task automatically resumes. Otherwise, the task fails. Note You are billed for task run time during connection retries. Customize the retry duration based on your business needs, or release the DTS instance as soon as the source and destination instances are released.

After you complete the preceding configurations, click Next in the lower-right corner of the page.

Configure advanced settings for initial synchronization.

Parameter	Description
Initial synchronization	By default, both Initial Schema Synchronization and Initial Full Data Synchronization are selected. Before synchronizing incremental data, DTS synchronizes the schema and existing data of the selected objects to the destination.
Filter options	By default, Ignore DDL in incremental synchronization phase is selected. This means that DTS does not synchronize DDL operations performed on the source database during incremental data synchronization.

After completing the preceding configurations, click Precheck and Start in the lower-right corner of the page.
Note
- A precheck runs before the synchronization task starts, and you can only start the task after it passes.
- If the precheck fails, click the icon next to the failed item to view the details.
  
  You can fix the issues based on the cause and run the precheck again.
  
  If you do not need to fix the items that triggered warnings, you can click Ignore or Ignore Warnings and Rerun Precheck to skip the warnings and run the precheck again.
After the Precheck dialog box shows Precheck Passed, close the Precheck dialog box. The data synchronization task starts.
On the Data Synchronization page, the task list displays key information, including Instance ID/Task Name, Status (for example, Synchronizing), Synchronization Overview (including latency and speed), Billing Method, and Synchronization Topology. The Actions column provides options such as Pause Synchronization, Convert to Subscription, and Upgrade.