Sync PolarDB MySQL Data to Elasticsearch via DTS - Data Transmission Service

Use Data Transmission Service (DTS) to continuously synchronize data from a PolarDB for MySQL cluster to an Elasticsearch instance. DTS handles schema synchronization, full data load, and ongoing incremental replication in a single task.

Quick start

Before you configure the task, complete these three steps:

Prepare the source PolarDB for MySQL cluster: Enable binary logging, set loose_polar_log_bin to ON, and retain binary logs for at least 3 days (7 days recommended).
Prepare the destination Elasticsearch instance: Create an instance with storage space larger than the source cluster. Development and test specifications are not supported.
Prepare database accounts: Grant read permissions on the source cluster, and have the elastic login credentials ready for the destination instance.

Prerequisites

Before you begin, ensure that you have:

A destination Elasticsearch instance in the same region as your synchronization task, with storage space larger than the source PolarDB for MySQL cluster. See Create an Alibaba Cloud Elasticsearch instance.
Binary logging enabled on the source PolarDB for MySQL cluster, with the loose_polar_log_bin parameter set to ON. See Enable binary logging and Set cluster and node parameters.
Binary logs retained for at least 3 days (7 days recommended). Retaining logs for less than the required period can cause task failures, data inconsistency, or data loss—issues not covered by the DTS Service-Level Agreement (SLA).
Database accounts with the required permissions. See Database account permissions.

Enabling binary logging on a PolarDB for MySQL cluster consumes storage space and incurs fees.

For supported source and destination database versions, see Synchronization overview. Different Elasticsearch instance specifications support different storage capacities.

Billing

Synchronization type	Pricing
Schema synchronization and full data synchronization	Free of charge
Incremental data synchronization	Charged. See Billing overview.

Limitations

Review the following limitations before configuring the task.

Source database

Tables must have a primary key or a UNIQUE constraint with unique field values. Without this, duplicate data may appear in the destination.
For table-level synchronization with column mapping, a single task supports a maximum of 1,000 tables. If you exceed this limit, split the tables across multiple tasks, or configure the task to synchronize the entire database instead.
Do not run DDL operations that change database or table schemas during schema synchronization or full data synchronization—the task will fail.
During full data synchronization, DTS queries the source database, creating metadata locks that may block DDL operations on the source.
Binary logging must be enabled with loose_polar_log_bin set to ON. If not, the precheck fails and the DTS instance cannot start.

Other limitations

Synchronization from a read-only node of the source PolarDB for MySQL cluster is not supported.
OSS foreign tables from the source cluster cannot be synchronized.
DTS does not support synchronizing INDEX, PARTITION, VIEW, PROCEDURE, FUNCTION, TRIGGER, or foreign key (FK) objects.
Synchronizing to Elasticsearch indexes that contain parent-child relationships or Join field types may cause task errors or query failures in the destination.
Primary/secondary failover of the database instance is not supported during initial full data synchronization. If a failover occurs, reconfigure the synchronization task promptly.
To add columns to source tables, first update the corresponding mapping in the Elasticsearch instance, then run the DDL on the source, and finally pause and restart the synchronization task.
PolarDB for MySQL and Elasticsearch support different data types, so one-to-one type mapping is not possible. During schema initialization, DTS maps types based on what the destination supports. See Data type mappings for schema initialization.
Run the synchronization during off-peak hours. Full data synchronization consumes read and write resources on both databases, which may increase load.
Full data synchronization runs concurrent INSERT operations, which causes fragmentation in destination tables. As a result, the tablespace of the destination instance is larger than that of the source after full data synchronization completes.
For table-level synchronization, do not use tools such as pt-online-schema-change for online DDL operations on synchronized objects in the source—the task will fail.
For table-level synchronization, if no data other than DTS writes goes to the destination, use Data Management (DMS) for online DDL operations. See Change schemas without locking tables.
If data other than DTS writes goes to the destination during synchronization, data inconsistency between source and destination may occur.
If a task fails, DTS support staff will attempt to restore it within 8 hours. During restoration, they may restart the task or adjust DTS task parameters (not database parameters). Parameters that may be adjusted are listed in Modify instance parameters.

DTS periodically executes the CREATE DATABASE IF NOT EXISTS \test\`` command on the source database to advance the binary log offset.

Supported SQL operations

Operation type	SQL operations
DML	INSERT, UPDATE, DELETE

The UPDATE statement cannot be used to remove fields.

Database account permissions

Database	Required permissions	How to create and grant
Source PolarDB for MySQL cluster	Read permissions on the objects to synchronize	See Create an account and Modify account permissions
Destination Elasticsearch instance	Login name (default: `elastic`) and password set when the instance was created	—

Data type mappings

Because PolarDB for MySQL and Elasticsearch support different data types, DTS maps types based on what the destination Elasticsearch instance supports during schema initialization. See Data type mappings for initial schema synchronization.

DTS does not set the mapping parameter for dynamic during schema migration. The behavior depends on your Elasticsearch instance settings. If source data is in JSON format, ensure values for the same key have the same data type across all rows in a table—otherwise, DTS may report synchronization errors. See dynamic.

The following table shows how Elasticsearch concepts map to relational database concepts:

Elasticsearch	Relational database
Index	Database
Type	Table
Document	Row
Field	Column
Mapping	Database schema

Create a synchronization task

Step 1: Open the data synchronization page

Open the data synchronization task list in one of the following ways:

DTS console

Log on to the DTS console.
In the navigation pane on the left, click Data Synchronization.
In the upper-left corner of the page, select the region where the synchronization instance is located.

DMS console

Note

The actual steps may vary depending on the mode and layout of the DMS console. For more information, see Simple mode console and Customize the layout and style of the DMS console.

Log on to the DMS console.
In the top menu bar, choose Data + AI > DTS (DTS) > Data Synchronization.
To the right of Data Synchronization Tasks, select the region of the synchronization instance.

Step 2: Configure source and destination databases

Click Create Task.
Configure the source and destination databases using the following settings.

General

Parameter	Description
Task Name	DTS automatically generates a name. Specify a descriptive name for easy identification. The name does not need to be unique.

Source database

Parameter	Description
Select Existing Connection	Select a registered database instance from the drop-down list to auto-fill the connection details. If no registered instance exists or you prefer not to use one, configure the connection details manually. Note In the DMS console, this field is labeled Select a DMS database instance.
Database Type	Select PolarDB for MySQL.
Access Method	Select Alibaba Cloud Instance.
Instance Region	Select the region where the source PolarDB for MySQL cluster resides.
Replicate Data Across Alibaba Cloud Accounts	Select No if the source cluster belongs to the current Alibaba Cloud account.
PolarDB Cluster ID	Select the ID of the source PolarDB for MySQL cluster.
Database Account	Enter the database account for the source cluster. See Database account permissions.
Database Password	Enter the password for the database account.
Encryption	Select as needed. See Configure SSL encryption.

Destination database

Parameter	Description
Select Existing Connection	Select a registered database instance from the drop-down list to auto-fill the connection details. If no registered instance exists or you prefer not to use one, configure the connection details manually. Note In the DMS console, this field is labeled Select a DMS database instance.
Database Type	Select Elasticsearch.
Access Method	Select Alibaba Cloud Instance.
Instance Region	Select the region where the destination Elasticsearch instance resides.
Type	Select Cluster or Serverless as needed.
Instance ID	Select the ID of the destination Elasticsearch instance.
Database Account	Enter the default login name `elastic`.
Database Password	Enter the password for the `elastic` account.
Encryption	Select HTTP or HTTPS as needed.

Click Test Connectivity and Proceed at the bottom of the page.

Add the CIDR blocks of DTS servers to the security settings of both the source and destination databases to allow access. See Add the IP address whitelist of DTS servers. If the source or destination is a self-managed database (where Access Method is not Alibaba Cloud Instance), also click Test Connectivity in the CIDR Blocks of DTS Servers dialog box.

Step 3: Configure task objects

On the Configure Objects page, specify the objects to synchronize.

Parameter	Description
Synchronization Types	DTS always selects Incremental Data Synchronization. Also select Schema Synchronization and Full Data Synchronization (selected by default). After the precheck, DTS initializes the destination cluster with the full data of the selected source objects as the baseline for incremental synchronization.
Index Name	Table Name: the index name in the destination matches the table name. Database Name_Table Name: the index name is the database name, an underscore (`_`), and the table name concatenated. This setting applies to all tables.
Processing Mode of Conflicting Tables	Precheck and Report Errors: checks for tables with the same name in the destination. If found, the precheck reports an error and the task does not start. If you cannot delete or rename the conflicting table, map it to a different name. See Database Table Column Name Mapping. Ignore Errors and Proceed: skips the same-name check. Warning This option may cause data inconsistency. During full data synchronization, DTS skips source records that conflict with destination records. During incremental synchronization, DTS overwrites destination records. If table schemas are inconsistent, initialization may fail or result in partial synchronization.
Capitalization of object names in destination instance	Set the case policy for database, table, and column names in the destination. The default is DTS default policy. See Case policy for destination object names.
Source Objects	In the Source Objects box, click the objects to synchronize, then click the arrow to move them to the Selected Objects box. Select objects at the database or table level.
Selected Objects	To modify the index name, type name, field name, or filter condition for a table, right-click the table in the Selected Objects area. See Map database and table column names and Set filter conditions. Only underscores (`_`) are allowed as special characters in index and type names.

Step 4: Configure advanced settings

Click Next: Advanced Settings and configure the following:

Parameter	Description
Dedicated cluster for task scheduling	By default, DTS uses a shared cluster. For greater task stability, purchase a dedicated cluster. See What is a DTS dedicated cluster?.
Retry time for failed connections	If the connection to the source or destination fails after the task starts, DTS immediately retries. The default retry duration is 720 minutes. Set a value from 10 to 1,440 minutes—30 minutes or more is recommended. If the connection is restored within this period, the task resumes. Note If multiple DTS instances share a source or destination, DTS applies the shortest configured retry duration across all instances. DTS charges for task runtime during connection retries.
Retry time for other issues	If a non-connection issue occurs (for example, a DDL or DML execution error), DTS immediately retries. The default is 10 minutes. Set a value from 1 to 1,440 minutes—10 minutes or more is recommended. Important This value must be less than the retry time for failed connections.
Enable throttling for full data synchronization	Limit the synchronization rate to reduce pressure on the destination by setting Queries per second (QPS) to the source database, RPS of Full Data Migration, and Data migration speed for full migration (MB/s). Available only when Full Data Synchronization is selected. See also Adjust the rate of full data synchronization.
Enable throttling for incremental data synchronization	Limit the incremental synchronization rate by setting RPS of Incremental Data Synchronization and Data synchronization speed for incremental synchronization (MB/s).
Environment Tag	Select an environment label to identify the instance, if needed.
Shard Configuration	Set the number of primary and replica shards for the index, based on the maximum shard configuration in the destination Elasticsearch instance.
String Index	How strings are indexed in the destination: analyzed (analyze first, then index—also select an analyzer; see Analyzers), not analyzed (index the raw value directly), or no (do not index).
Time Zone	When synchronizing DATETIME or TIMESTAMP data types to Elasticsearch, select the time zone to use. Note If time zone information is not needed, pre-configure the document type for this data in the destination instance.
DOCID	No configuration required. DOCID defaults to the table's primary key. If no primary key exists, Elasticsearch auto-generates the ID.
Whether to delete SQL operations on heartbeat tables	Yesalert notifications: DTS does not write heartbeat SQL to the source. The DTS instance may display latency. No: DTS writes heartbeat SQL to the source, which may interfere with operations like physical backups and cloning.
Configure ETL	Choose whether to enable extract, transform, and load (ETL). Yes: enables ETL; enter data processing statements in the code editor. See Configure ETL in a data migration or data synchronization task. No: disables ETL. See What is ETL?.
Monitoring and Alerting	No: no alerts configured. Yes: configure the alert threshold and notification contacts. See Configure monitoring and alerting during task configuration.

Step 5: Configure database and table fields

Click Next: Configure Database and Table Fields to set the routing strategy and document ID for tables in the destination Elasticsearch instance.

Set Definition Status to All to view and edit all tables.

Parameter	Description
Set _routing	Yes: define a custom column for routing documents to specific shards. See _routing. No: route using `_id`. If the destination Elasticsearch instance is version 7.x, select No.
_routing Column	Select the column to use for routing. Required only when Set _routing is Yes.
Value of _id	Select the column to use as the document ID.

Step 6: Save the task and run the precheck

To view the API parameters for this configuration, hover over Next: Save Task Settings and Precheck and click Preview OpenAPI parameters.
Click Next: Save Task Settings and Precheck to save the task and start the precheck.

DTS runs a precheck before starting the task. The task starts only if the precheck passes.

If the precheck fails, click View Details next to the failed item, fix the issue as prompted, and rerun the precheck.

For non-ignorable warnings, click View Details, fix the issue, and rerun the precheck.

For ignorable warnings, click Confirm Alert Details > Ignore > OK, then click Precheck Again. Ignoring precheck warnings may cause data inconsistency. Proceed with caution.

Step 7: Purchase the instance

When the Success Rate reaches 100%, click Next: Purchase Instance.
On the Purchase page, configure the instance.

Category	Parameter	Description
New instance class	Billing method	Subscription: pay upfront for a set duration. Cost-effective for long-term, continuous tasks. Pay-as-you-go: billed hourly for actual usage. Ideal for short-term or test tasks—release the instance at any time to stop charges.
Resource group settings	—	The resource group for the instance. Defaults to default resource group. See What is Resource Management?.
Instance class	—	Different specifications affect the synchronization rate. Select based on your requirements. See Data synchronization link specifications.
Subscription duration	—	In subscription mode, select the duration and quantity. Monthly options: 1–9 months. Yearly options: 1, 2, 3, or 5 years. Appears only when Billing method is Subscription.

Select the checkbox for Data Transmission Service (Pay-as-you-go) Service Terms.
Click Buy and Start, then click OK.

Monitor the task progress on the data synchronization page.