Sync PolarDB MySQL Data to Self-Managed Doris - Data Transmission Service

Data Transmission Service (DTS) lets you continuously sync data from a PolarDB for MySQL cluster to a self-managed Doris database for large-scale analytics. This topic walks through the setup using a Doris database deployed on an Elastic Compute Service (ECS) instance.

Prerequisites

Before you begin, make sure you have:

A destination Doris database with available storage space larger than the storage used by the source PolarDB for MySQL cluster
Binary logging enabled on the source PolarDB for MySQL cluster, with the loose_polar_log_bin parameter set to ON. See Enable binary logging and Modify parameters
Binary log retention period of at least three days (seven days recommended)
Database accounts with the required permissions (see Permissions required)

Enabling binary logging on a PolarDB for MySQL cluster incurs storage charges for binary log files.

For supported source and destination database versions, see Overview of data synchronization solutions.

Permissions required

Database	Required permissions	Reference
Source PolarDB for MySQL cluster	Read and write permissions on the objects to be synchronized	Create and manage a database account
Destination Doris database	USAGE_PRIV and the following: SELECT_PRIV, LOAD_PRIV, ALTER_PRIV, CREATE_PRIV, DROP_PRIV	Authentication and authorization

Billing

Synchronization type	Cost
Schema synchronization and full data synchronization	Free
Incremental data synchronization	Charged. See Billing overview

Supported SQL operations for incremental synchronization

Operation type	Statements
DML (Data Manipulation Language)	INSERT, UPDATE, DELETE
DDL (Data Definition Language)	ADD COLUMN, MODIFY COLUMN, CHANGE COLUMN, DROP COLUMN, DROP TABLE, TRUNCATE TABLE, RENAME TABLE

Important

The RENAME TABLE operation can cause data inconsistency. If you rename a table that is selected as a synchronization object, its data stops syncing to the destination. To avoid this, select the entire database as the synchronization object instead of individual tables, and make sure that both the original and renamed table belong to databases included in the synchronization scope.

Limitations

Source database

Tables with PRIMARY KEY or UNIQUE constraints: all fields must be unique, or the destination may contain duplicate data.
Tables without PRIMARY KEY or UNIQUE constraints: select Schema Synchronization for Synchronization Types and duplicate for Engine in the Configurations for Databases, Tables, and Columns step.
If you select tables (not databases) as synchronization objects and need to rename tables or columns in the destination, a single task supports up to 1,000 tables. Exceeding this limit causes a request error. Configure multiple tasks or sync the entire database instead.
Binary log requirements:
- loose_polar_log_bin must be set to ON
- Retention period: at least three days, seven days recommended
Do not run DDL operations that change database or table schemas during initial schema synchronization or initial full data synchronization. DTS queries the source database during full sync, which creates metadata locks that may block DDL operations.
Changes from operations not recorded in binary logs—such as data restored from physical backups or generated by cascade operations—are not synced to the destination. If this occurs, you can remove and re-add the affected database or table from the synchronization objects if your business permits. See Modify synchronization objects.

General limitations

DTS does not sync read-only nodes of the source PolarDB for MySQL cluster.
DTS does not sync Object Storage Service (OSS) external tables from the source.
Primary/standby switchover is not supported during initial full data synchronization. If a switchover occurs, reconfigure the synchronization task.
Only tables using the Unique Key Model or Duplicate Key Model in Doris are supported. When using the Duplicate Key Model, DTS converts UPDATE and DELETE statements to INSERT statements. Duplicate data may appear if any of the following occur: Use the additional columns _is_deleted, _version, and _record_id to deduplicate if needed (see Additional columns added by DTS).
- A retry operation runs on the data synchronization instance
- The instance is restarted
- Two or more DML operations are applied to the same row after the instance starts
Only the bucket_count parameter can be specified in the Selected Objects section. The value must be a positive integer. Default: auto.
Do not create clusters in the destination Doris database during synchronization. If this happens, the task fails. Restart the data synchronization instance to resume.
Doris only supports database and table names that start with a letter. Use the object name mapping feature to rename any object whose name starts with a non-letter character.
If a database, table, or column name contains Chinese characters, use object name mapping to rename it (for example, to an English name). Otherwise, the task may fail.
You cannot modify DDL operations on multiple columns at once, or modify DDL operations on a table consecutively.
Do not add backend (BE) nodes to the destination Doris database during synchronization. If this happens, the task fails. Restart the data synchronization instance to resume.
In multi-table merge scenarios (multiple source tables syncing to one destination table), the schemas of all source tables must be identical. Otherwise, data inconsistency or task failure may occur.
In PolarDB for MySQL, VARCHAR(M) uses character length for M. In Doris, VARCHAR(N) uses byte length for N. If you are not using schema synchronization, set the Doris VARCHAR length to four times the corresponding MySQL VARCHAR length to avoid data loss.
When using DMS or gh-ost to perform online DDL changes on the source, DTS syncs only the original DDL to the destination. Online DDL changes made with pt-online-schema-change are not supported and may cause data loss or synchronization failure.
During initial full data synchronization, DTS uses read and write resources of both databases. Evaluate the performance impact beforehand and schedule synchronization during off-peak hours, when CPU usage on both ends is below 30%.
Concurrent INSERT operations during initial full data synchronization cause table fragmentation in the destination. After full sync completes, destination tables may occupy more storage than the source.
Do not use pt-online-schema-change or similar tools for online DDL operations on synchronization objects during synchronization. This causes sync failure.
If data is written to the destination database from sources other than DTS during synchronization, data inconsistency may occur between source and destination.
DTS periodically executes CREATE DATABASE IF NOT EXISTS \test\`` on the source to advance the binary log position.
During incremental synchronization, DTS uses a batch strategy and writes to each synchronization object at most once every five seconds. This may introduce a normal synchronization latency of less than 10 seconds. To reduce latency, adjust the selectdb.reservoir.timeout.milliseconds parameter (range: 1,000–10,000 milliseconds) in the DTS console. Shorter batching intervals increase write frequency, which may raise the destination's load and response time.
If an instance fails, DTS support attempts recovery within eight hours. During recovery, the instance may be restarted or its parameters adjusted. Only DTS instance parameters are modified—database parameters remain unchanged.

Create a data synchronization task

Step 1: Go to the data synchronization page

Use one of the following methods:

DTS console

Log on to the DTS console.DTS console
In the left-side navigation pane, click Data Synchronization.
In the upper-left corner, select the region where the data synchronization task resides.

DMS console

Steps may vary based on the DMS console mode and layout. See Simple mode and Customize the layout and style of the DMS console.

Log on to the DMS console.DMS console
In the top navigation bar, move the pointer over Data + AI and choose DTS (DTS) > Data Synchronization.
From the drop-down list next to Data Synchronization Tasks, select the region where the instance resides.

Step 2: Configure source and destination databases

Click Create Task.
Configure the Task Name. DTS generates a name automatically. Specify a descriptive name to make the task easy to identify—a unique name is not required.
Configure the source and destination database connections using the parameters below.

Source database (PolarDB for MySQL)

Parameter	Value
Select Existing Connection	Select a registered instance from the list to auto-populate the following fields, or configure them manually
Database Type	PolarDB for MySQL
Access Method	Alibaba Cloud Instance
Instance Region	The region where the source PolarDB for MySQL cluster resides
Replicate Data Across Alibaba Cloud Accounts	No (this example uses the current account)
PolarDB Cluster ID	The ID of the source PolarDB for MySQL cluster
Database Account	The account for the source cluster (see Permissions required)
Database Password	The password for the database account
Encryption	Select as needed. For SSL configuration, see Configure SSL encryption

Destination database (Doris)

Parameter	Value
Select Existing Connection	Select a registered instance from the list to auto-populate the following fields, or configure them manually
Database Type	Doris
Access Method	Self-managed Database on ECS (this example). For other access methods, see Preparations
Instance Region	The region where the destination Doris database resides
ECS Instance ID	The ID of the ECS instance running the Doris database. If BE or frontend (FE) nodes are on separate ECS instances, add the CIDR blocks of DTS servers to each instance's security rules
Port Number	The service port of the destination Doris database. Default: 9030
Database Account	The account for the destination database (see Permissions required)
Database Password	The password for the database account

Click Test Connectivity and Proceed. In the CIDR Blocks of DTS Servers dialog box, click Test Connectivity.

Make sure that the CIDR blocks of DTS servers are added to the security settings of both source and destination databases. See Add DTS server IP addresses to a whitelist.

Step 3: Select synchronization objects

In the Configure Objects step, configure the following:

Parameter	Description
Synchronization Types	Select Schema Synchronization, Full Data Synchronization, and Incremental Data Synchronization. Full sync establishes the baseline data before incremental sync begins. If you skip schema synchronization, create the destination tables using the Unique Key Model or Duplicate Key Model in advance. See Data type mappings, Additional columns added by DTS, and Unique Key Model
Processing Mode of Conflicting Tables	Precheck and Report Errors: reports an error if a table with the same name exists in the destination. Ignore Errors and Proceed: skips the check. If table schemas differ, data initialization may fail or only partial columns sync. If schemas are the same, source data overwrites destination records with matching primary or unique keys
Capitalization of Object Names in Destination Instance	Defaults to DTS default policy. Adjust as needed. See Specify the capitalization of object names
Source Objects	Select databases or tables to sync, then click the right arrow icon to add them to Selected Objects
Selected Objects	Right-click an object to rename it, filter data with WHERE conditions, or select specific SQL operations. If schema synchronization is enabled, right-click a table and configure `bucket_count` in Parameter Settings

Click Next: Advanced Settings and configure the following:

Parameter	Description
Dedicated Cluster for Task Scheduling	By default, DTS uses the shared cluster. For higher stability, purchase a dedicated cluster. See What is a DTS dedicated cluster
Retry Time for Failed Connections	How long DTS retries after a connection failure. Range: 10–1,440 minutes. Default: 720. Set to at least 30. If reconnection succeeds within this window, sync resumes; otherwise, the task fails. The shortest value across tasks sharing the same source or destination applies
Retry Time for Other Issues	How long DTS retries after DDL or DML failures. Range: 1–1,440 minutes. Default: 10. Set to at least 10. Must be less than Retry Time for Failed Connections
Enable Throttling for Full Data Synchronization	Throttle full sync by setting QPS (queries per second) to the source, RPS (rows per second) of full migration, and data migration speed. Available only when Full Data Synchronization is selected
Enable Throttling for Incremental Data Synchronization	Throttle incremental sync by setting RPS and data sync speed
Whether to delete SQL operations on heartbeat tables of forward and reverse tasks	Yesalert notification settings: does not write heartbeat SQL to the source (latency may be displayed). No: writes heartbeat SQL (may affect physical backup and cloning of the source)
Environment Tag	Optional. Tag the instance for environment identification
Configure ETL	Enable extract, transform, and load (ETL) to apply data transformations. See What is ETL? and Configure ETL
Monitoring and Alerting	Configure alerts for task failures or latency exceeding a threshold. See Configure monitoring and alerting

(Optional) Click Next: Configure Database and Table Fields. Specify Primary Key Column, Distribution Key, and Engine for each table.

This step is available only when Schema Synchronization is selected. Set Definition Status to All to view all tables.

Primary Key Column supports multiple columns. Distribution Key must be a subset of the primary key columns.

For tables without primary keys or UNIQUE constraints, select duplicate for Engine. Otherwise, the task may fail or data loss may occur.

Step 4: Run the precheck

Click Next: Save Task Settings and Precheck.

DTS runs a precheck before starting the task. The task starts only after passing all checks.

If any check fails, click View Details to see the cause, fix the issue, and rerun the precheck.

If a warning appears: fix and rerun if the item cannot be ignored; click Confirm Alert Details > Ignore > OK > Precheck Again if the item can be safely ignored. Ignoring alerts may cause data inconsistency.

To preview the OpenAPI parameters for this task configuration, move the pointer over Next: Save Task Settings and Precheck and click Preview OpenAPI parameters.

Step 5: Purchase an instance

Wait until Success Rate reaches 100%, then click Next: Purchase Instance.
On the purchase page, configure the following:

Parameter	Description
Billing Method	Subscription: pay upfront for a fixed term, more cost-effective for long-term use. Pay-as-you-go: billed hourly, suitable for short-term use. Release the instance when no longer needed to stop charges
Resource Group Settings	The resource group for the instance. Default: default resource group. See What is Resource Management?
Instance Class	Select based on your throughput requirements. See Instance classes of data synchronization instances
Subscription Duration	Available for subscription billing. Options: 1–9 months, 1, 2, 3, or 5 years

Read and select Data Transmission Service (Pay-as-you-go) Service Terms.
Click Buy and Start, then click OK in the dialog box.

The task appears in the task list. Monitor its progress from there.

Data type mappings

The following table shows how PolarDB for MySQL data types map to Doris data types.

Category	PolarDB for MySQL	Doris
Numeric	TINYINT	TINYINT
	TINYINT UNSIGNED	SMALLINT
	SMALLINT	SMALLINT
	SMALLINT UNSIGNED	INT
	MEDIUMINT	INT
	MEDIUMINT UNSIGNED	BIGINT
	INT	INT
	INT UNSIGNED	BIGINT
	BIGINT	BIGINT
	BIGINT UNSIGNED	LARGEINT
	BIT(M)	INT
Decimal	Decimal	Decimal (ZEROFILL not supported)
	Numeric	Decimal
	Float	Float
	Double	DOUBLE
	BOOL / BOOLEAN	BOOLEAN
Date and time	DATE	DATEV2
	DATETIME[(fsp)]	DATETIMEV2
	Timestamp[(fsp)]	DATETIMEV2
	Time[(fsp)]	VARCHAR
	YEAR[(4)]	INT
String	CHAR / VARCHAR	VARCHAR
	BINARY / VARBINARY	STRING
	TINYTEXT / TEXT / MEDIUMTEXT / LONGTEXT	STRING
	TINYBLOB / BLOB / MEDIUMBLOB / LONGBLOB	STRING
	ENUM	STRING
	SET	STRING
	JSON	STRING

CHAR and VARCHAR conversion notes:

CHAR and VARCHAR(n) values are converted to VARCHAR(4*n) to avoid data loss. This accounts for the difference in how PolarDB for MySQL (character length) and Doris (byte length) define VARCHAR size.
If no length is specified, the default is VARCHAR(65533).
Values exceeding 65,533 characters are converted to the STRING type.