Synchronize data from a PolarDB for MySQL cluster to a SelectDB instance for big data analytics - Data Transmission Service

Data Transmission Service (DTS) synchronizes data from a PolarDB for MySQL cluster to a SelectDB instance, enabling real-time analytics on your operational data. SelectDB delivers sub-second query responses at massive scale, handles up to 10,000 point queries per second, and supports high-throughput complex analytics.

Prerequisites

Before you begin, make sure you have:

A SelectDB instance with available storage space larger than the storage used by the source PolarDB for MySQL cluster. To create one, see Create an instance
Binary logging enabled on the source PolarDB for MySQL cluster, with the loose_polar_log_bin parameter set to ON. Otherwise, error messages are returned during the precheck and a DTS task cannot be started. See Enable binary logging and Modify parameters
Binary log retention period set to at least 3 days (7 days recommended) on the source cluster. In exceptional circumstances, data inconsistency or loss may occur. Make sure that you configure the retention period of binary logs based on the preceding requirements. Otherwise, the service reliability or performance in the Service Level Agreement (SLA) of DTS may not be guaranteed. See Modify the retention period
Database accounts with the required permissions on both the source and destination databases. See Permissions required for database accounts

Enabling binary logging on a PolarDB for MySQL cluster incurs storage charges for binary log space.

Limitations

Source database

Tables with primary keys or unique constraints: all fields in the destination must be unique; otherwise, the destination may contain duplicate data.
Tables without primary keys or unique constraints: select Schema Synchronization for Synchronization Types and duplicate for Engine in the Configurations for Databases, Tables, and Columns step.
When selecting tables (not entire databases) as synchronization objects, a single task supports up to 1,000 tables. For more than 1,000 tables, configure multiple tasks or synchronize at the database level.
Do not perform Data Definition Language (DDL) operations that change schemas during initial schema synchronization or initial full data synchronization. During full data synchronization, DTS queries the source database, which creates metadata locks that may block DDL operations.
Data changes not recorded in binary logs — such as data restored from a physical backup or generated by cascade operations — are not synchronized to the destination. If this occurs, remove the affected database or table from the synchronization objects and add it back. See Modify synchronization objects.

Unsupported objects and operations

Read-only nodes of the source PolarDB for MySQL cluster are not synchronized.
OSS external tables from the source cluster are not synchronized.
INDEX, PARTITION, VIEW, PROCEDURE, FUNCTION, TRIGGER, and FK are not synchronized.
Primary/secondary switchover is not supported during full synchronization. If a switchover occurs, reconfigure the task.
You cannot modify DDL on multiple columns at a time or modify DDL on the same table consecutively.
Online DDL changes using pt-online-schema-change are not supported. If such changes exist on the source, data loss may occur or the synchronization instance may fail.

Destination constraints

Only the Unique Key model and the Duplicate Key model are supported in the destination SelectDB instance.
Database and table names in SelectDB must start with a letter. Use the object name mapping feature to rename any that do not.
If a database, table, or column name contains Chinese characters, use the object name mapping feature to rename it (for example, rename it to an English equivalent). Otherwise, the task may fail.
Do not create clusters in the destination SelectDB instance during synchronization. If the task fails because of this, restart the data synchronization instance to resume.
Do not add backend nodes to the destination SelectDB instance during synchronization. If the task fails because of this, restart the data synchronization instance to resume.
Only the bucket_count parameter can be specified in the Selected Objects section. The value must be a positive integer. Default: auto.
In multi-table merge scenarios where data from multiple source tables is synchronized to the same destination table, all source tables must have the same schema. Mismatched schemas may cause data inconsistency or task failure.
In PolarDB for MySQL, M in VARCHAR(M) represents character length. In SelectDB, N in VARCHAR(N) represents byte length. If you are not using DTS schema synchronization, set SelectDB VARCHAR field lengths to four times the corresponding PolarDB for MySQL VARCHAR field lengths.
When using DMS or gh-ost to perform online DDL changes on the source, DTS synchronizes only the original DDL to the destination. This may cause table locks at the destination.
If data is written to the destination database from sources other than DTS during synchronization, data inconsistency may occur.
DTS executes CREATE DATABASE IF NOT EXISTS \test\`` in the source database at scheduled intervals to advance the binary log position.
During incremental synchronization, DTS uses a batch strategy that writes data for each synchronization object at most once every 5 seconds. This results in a normal synchronization latency of less than 10 seconds. To reduce this latency, adjust the selectdb.reservoir.timeout.milliseconds parameter in the DTS console. The allowed range is 1,000–10,000 milliseconds. Lower values increase write frequency, which may increase the load and response time (RT) of the destination.
If a DTS instance fails, the DTS helpdesk attempts recovery within 8 hours. During recovery, the instance may be restarted or its parameters adjusted. For parameters that may be modified, see Modify instance parameters.
Before synchronizing data, evaluate the performance of both databases and run synchronization during off-peak hours. Initial full data synchronization runs concurrent INSERT operations, which increases the load on both databases and causes table fragmentation in the destination, resulting in a larger tablespace than in the source.

Unique Key model

All unique keys in the destination table must exist in the source table and be included in the synchronization objects. Otherwise, data inconsistency may occur.

Duplicate Key model

Duplicate data may appear in the destination if any of the following occurs:

A retry operation in a data synchronization instance
Two or more Data Manipulation Language (DML) operations on the same row after the synchronization instance starts

To deduplicate data, use the additional columns _is_deleted, _version, and _record_id. For column details, see Additional column information.

For Duplicate Key model tables, DTS converts UPDATE and DELETE statements to INSERT statements.

Billing

Synchronization type	Task configuration fee
Schema synchronization and full data synchronization	Free of charge
Incremental data synchronization	Charged. See Billing overview.

SQL operations that support incremental synchronization

Operation type	SQL statement
DML	INSERT, UPDATE, DELETE
DDL	ADD COLUMN, MODIFY COLUMN, CHANGE COLUMN, DROP COLUMN, DROP TABLE, TRUNCATE TABLE, RENAME TABLE

Important

The RENAME TABLE operation may cause data inconsistency. If you rename a table during synchronization and only the table (not the entire database) was selected as the synchronization object, the renamed table's data is not synchronized. To prevent this, select the database as the synchronization object and make sure both the original and renamed table's database are included.

Permissions required for database accounts

Database	Required permissions	How to configure
Source: PolarDB for MySQL	Read and write permissions on the objects to be synchronized	Create and manage a database account and Manage the password of a database account
Destination: SelectDB	USAGE_PRIV on the cluster; SELECT_PRIV, LOAD_PRIV, ALTER_PRIV, CREATE_PRIV, and DROP_PRIV on the database	Cluster Permission Management and Basic Permission Management

Create a data synchronization task

Step 1: Go to the Data Synchronization page

Use one of the following methods to open the Data Synchronization page.

DTS console

Log on to the DTS console.
In the left-side navigation pane, click Data Synchronization.
In the upper-left corner, select the region where the data synchronization instance resides.

Data Management (DMS) console

The exact navigation path may vary based on the DMS console mode and layout. See Simple mode and Customize the layout and style of the DMS console.

Log on to the DMS console.
In the top navigation bar, move the pointer over Data + AI and choose DTS (DTS) > Data Synchronization.
From the drop-down list to the right of Data Synchronization Tasks, select the region where the instance resides.

Step 2: Configure source and destination databases

Click Create Task.

Configure the task name and source database.

Parameter	Description
Task Name	DTS generates a name automatically. Specify a descriptive name to make the task easy to identify. The name does not need to be unique.
Select Existing Connection (source)	If the source instance is registered with DTS, select it from the list. DTS automatically populates the connection parameters. Otherwise, configure the parameters manually. See Manage database connections.
Database Type	Select PolarDB for MySQL.
Access Method	Select Alibaba Cloud Instance.
Instance Region	Select the region where the source PolarDB for MySQL cluster resides.
Replicate Data Across Alibaba Cloud Accounts	Select No to synchronize within the same Alibaba Cloud account.
PolarDB Cluster ID	Select the ID of the source PolarDB for MySQL cluster.
Database Account	Enter the account for the source cluster. See Permissions required for database accounts.
Database Password	Enter the password for the database account.
Encryption	Select an encryption method as needed. For Secure Sockets Layer (SSL) encryption, see Set SSL encryption.

Configure the destination database.

Parameter	Description
Select Existing Connection (destination)	If the destination instance is registered with DTS, select it from the list. DTS automatically populates the connection parameters. Otherwise, configure the parameters manually. See Manage database connections.
Database Type	Select SelectDB.
Access Method	Select Alibaba Cloud Instance.
Instance Region	Select the region where the destination SelectDB instance resides.
Replicate Data Across Alibaba Cloud Accounts	Select No to synchronize within the same Alibaba Cloud account.
Instance ID	Select the ID of the destination SelectDB instance.
Database Account	Enter the account for the destination instance. See Permissions required for database accounts.
Database Password	Enter the password for the database account.

Click Test Connectivity and Proceed.
DTS server CIDR blocks must be added to the security settings of both the source and destination databases. DTS can add them automatically, or you can add them manually. See Add DTS server IP addresses to a whitelist. If the Access Method of the source or destination database is not Alibaba Cloud Instance, click Test Connectivity in the CIDR Blocks of DTS Servers dialog box.

Step 3: Configure synchronization objects

In the Configure Objects step, set the following parameters.

Parameter	Description
Synchronization Types	By default, Incremental Data Synchronization is selected. Also select Schema Synchronization and Full Data Synchronization. After the precheck, DTS synchronizes historical data from the source to the destination as the basis for subsequent incremental synchronization. Important When synchronizing from PolarDB for MySQL to SelectDB, data types are converted. If you do not select Schema Synchronization, create Unique or Duplicate model tables with appropriate schemas in the destination SelectDB instance in advance. See Data type mappings, Additional column information, and Data models.
Processing Mode of Conflicting Tables	Precheck and Report Errors: checks for tables with the same name in the destination. If a conflict exists, an error is reported during precheck and the task does not start. To resolve conflicts without deleting or renaming the destination table, use the object name mapping feature. See Map database, table, and column names. Ignore Errors and Proceed: skips the check. Warning This may cause data inconsistency. If schemas match, DTS overwrites destination records with the same primary key or unique key. If schemas differ, data initialization may fail or only some columns may be synchronized.
Capitalization of Object Names in Destination Instance	Controls the capitalization of database, table, and column names in the destination. Default: DTS default policy. See Specify the capitalization of object names in the destination instance.
Source Objects	Select one or more objects and click the icon to move them to Selected Objects. You can select databases or tables.
Selected Objects	To rename a synchronization object, right-click it. See Map the name of a single object. If Schema Synchronization is selected, right-click a table in Selected Objects, set Enable Parameter Settings to Yes, specify a value for `bucket_count`, and click OK. To filter SQL operations or rows for a specific object, right-click it and configure the options. For row filtering, see Specify filter conditions. Note Renaming an object with the object name mapping feature may cause dependent objects to fail synchronization.

Click Next: Advanced Settings and configure the following parameters.

Parameter	Description
Dedicated Cluster for Task Scheduling	By default, the task is scheduled to the shared cluster. For higher stability, purchase a dedicated cluster. See What is a DTS dedicated cluster.
Retry Time for Failed Connections	The time DTS retries if the source or destination database connection fails after the task starts. Valid range: 10–1,440 minutes. Default: 720 minutes. Set to a value greater than 30 minutes. If DTS reconnects within the retry period, the task resumes. Otherwise, the task fails. Note If multiple tasks share the same source or destination database, the shortest retry time takes precedence. DTS charges apply during the retry period.
Retry Time for Other Issues	The time DTS retries failed DDL or DML operations. Valid range: 1–1,440 minutes. Default: 10 minutes. Set to a value greater than 10 minutes. This value must be less than Retry Time for Failed Connections.
Enable Throttling for Full Data Synchronization	Limits read/write resource usage during full data synchronization. When enabled, configure Queries per second (QPS) to the source database, RPS of Full Data Migration, and Data migration speed for full migration (MB/s). Available only if Full Data Synchronization is selected.
Enable Throttling for Incremental Data Synchronization	Limits resource usage during incremental synchronization. When enabled, configure RPS of Incremental Data Synchronization and Data synchronization speed for incremental synchronization (MB/s).
Environment Tag	Optionally select a tag to identify the instance.
Whether to delete SQL operations on heartbeat tables of forward and reverse tasks	Controls whether DTS writes heartbeat table SQL operations to the source database. Yes: does not write heartbeat operations (a latency may be displayed for the DTS instance). No: writes heartbeat operations (may affect features such as physical backup and cloning of the source database).
Configure ETL	Enables the extract, transform, and load (ETL) feature. See What is ETL? Yes: configure ETL with data processing statements. See Configure ETL in a data migration or data synchronization task. No: skip ETL configuration.
Monitoring and Alerting	Configures alerts for task failures or synchronization latency exceeding a threshold. Yes: configure the alert threshold and notification settings. See Configure monitoring and alerting when you create a DTS task. No: alerting is disabled.

(Optional) Click Next: Configure Database and Table Fields. In the dialog box, specify Primary Key Column, Distribution Key, and Engine for the tables to synchronize.
This step is available only if Schema Synchronization is selected. Set Definition Status to All to view and modify all tables. You can select multiple columns for Primary Key Column. One or more of those columns can be used for Distribution Key. For tables without primary keys or unique constraints, select duplicate for Engine. Otherwise, the task may fail or data loss may occur.

Step 4: Run the precheck

Click Next: Save Task Settings and Precheck. To preview the API parameters for this task configuration, hover over the button and click Preview OpenAPI parameters.
Wait for the precheck to complete.
- If the precheck fails, click View Details next to each failed item, resolve the issue, then click Precheck Again.
- If an alert is triggered:
  - For alerts that cannot be ignored, click View Details, resolve the issue, and run the precheck again.
  - For alerts that can be ignored, click Confirm Alert Details, click Ignore in the dialog box, click OK, then click Precheck Again. Ignoring an alert may result in data inconsistency.
The task cannot start until it passes the precheck.

Step 5: Purchase and start the instance

Wait until Success Rate reaches 100%, then click Next: Purchase Instance.

On the buy page, configure the following parameters.

Parameter	Description
Billing Method	Subscription: pay upfront for a fixed period. More cost-effective for long-term use. Pay-as-you-go: billed hourly. Suitable for short-term use. Release the instance when it is no longer needed to stop charges.
Resource Group Settings	The resource group for the instance. Default: default resource group. See What is Resource Management?
Instance Class	Select a class based on your synchronization speed requirements. See Instance classes of data synchronization instances.
Subscription Duration	Available only for the Subscription billing method. Options: 1–9 months, 1 year, 2 years, 3 years, or 5 years.

Read and select Data Transmission Service (Pay-as-you-go) Service Terms.
Click Buy and Start, then click OK in the dialog box.

The task appears in the task list. You can monitor its progress there.

Data type mappings

Category	PolarDB for MySQL type	SelectDB type	Notes
Numeric	TINYINT	TINYINT
	TINYINT UNSIGNED	SMALLINT
	SMALLINT	SMALLINT
	SMALLINT UNSIGNED	INT
	MEDIUMINT	INT
	MEDIUMINT UNSIGNED	BIGINT
	INT	INT
	INT UNSIGNED	BIGINT
	BIGINT	BIGINT
	BIGINT UNSIGNED	LARGEINT
	BIT(M)	INT
Decimal	DECIMAL	DECIMAL	The zerofill attribute is not supported.
	NUMERIC	DECIMAL
	FLOAT	FLOAT
	DOUBLE	DOUBLE
	BOOL, BOOLEAN	BOOLEAN
Date and time	DATE	DATEV2
	DATETIME[(fsp)]	DATETIMEV2
	TIMESTAMP[(fsp)]	DATETIMEV2
	TIME[(fsp)]	VARCHAR
	YEAR[(4)]	INT
String	CHAR, VARCHAR	VARCHAR	To prevent data loss, CHAR and VARCHAR(n) values are converted to VARCHAR(4*n) in SelectDB. If no length is specified, the default VARCHAR(65533) is used. If the length exceeds 65,533, the value is converted to STRING.
	BINARY, VARBINARY	STRING
	TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT	STRING
	TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB	STRING
	ENUM	STRING
	SET	STRING
	JSON	STRING

Additional column information

When synchronizing to a Duplicate Key model table, DTS automatically adds the following columns or requires you to add them manually.

Name	Data type	Description
`_is_deleted`	Int	Indicates whether the row is deleted. INSERT and UPDATE operations set this to 0. DELETE operations set this to 1.
`_version`	Bigint	For full synchronization data: 0. For incremental synchronization data: the UNIX timestamp (in seconds) from the source binary log.
`_record_id`	Bigint	For full synchronization data: 0. For incremental synchronization data: the unique record ID from the incremental log. The value is unique and increments monotonically.