Real-time full-database synchronization tasks involve multiple stages, including schema migration, full initialization, and incremental synchronization. These tasks often manage numerous tables over long periods, making operations complex. This document describes core operations for these tasks, such as starting and stopping, modifying configurations, and setting up alerts. It also provides troubleshooting and tuning recommendations to help you manage your synchronization tasks efficiently.
Prerequisites
Before you perform operations or troubleshoot issues, ensure that the following conditions are met:
Permission requirements
-
Source account: The account must have permissions to read metadata, such as databases, tables, columns, primary keys, and indexes. It also requires permissions to read the source change log, such as Binlog, WAL (Write-Ahead Logging), or Oplog.
-
Target account: The account must have permissions to create tables, alter tables, and write data.
Network connectivity
The resource group must have network connectivity to both the source and the target. Network issues can cause schema migration to fail, full initialization to get stuck, or incremental synchronization to be interrupted.
Source log retention
Insufficient log retention is the most common reason a task fails to resume from its original offset after being stopped.
Data source and channel compatibility
Synchronization capabilities (such as full load, incremental, and DDL) vary by data source. The configurable options on the UI page represent the supported features. Verify that the versions of your source and target are supported by the channel.
Scope and applicability
Before troubleshooting a real-time full-database synchronization task, review the task type, channel capabilities, and key troubleshooting areas to confirm that this document applies to your scenario.
|
Criteria |
Applicable scope |
Not applicable / key focus |
|
Task type |
Real-time Change Data Capture (CDC) for multiple tables or a full database. |
This document does not apply to single-table real-time synchronization. For troubleshooting single-table tasks, see Operations and Tuning for Single-Table Real-Time Synchronization. |
|
Typical channels |
Database source to Hologres, MaxCompute, ADB (AnalyticDB), Doris, StarRocks, SelectDB, Kafka, DLF (Data Lake Formation), Lindorm, Elasticsearch, or OSS (Object Storage Service). |
Not applicable to other channel types. |
|
Troubleshooting focus |
Log retention, offset, primary key, DDL, source load, target write performance, partitions, commits, small files, rate limiting, and backlog. |
In addition to task status, use metrics and logs for fine-grained troubleshooting. |
Real-time full-database synchronization relies on the source to continuously provide a change log (such as Binlog, WAL, or Oplog). The target must support the current write mode, primary key or unique key semantics, and necessary schema changes. If these conditions are not met, the task may not run correctly or guarantee data consistency.
Task stages
A real-time full-database synchronization task typically includes the following stages, each with a different operational focus.
|
Stage |
Description |
Operational focus |
|
Schema migration |
Reads database, table, and column information from the source and creates or updates table structures on the target. |
Source metadata permissions, target table creation permissions, data type mapping, table name mapping rules. |
|
Full initialization |
Reads historical data from the source and writes it to the target to backfill data that existed before the task started. |
Split key, full load concurrency, source connection count, resource group, target write capacity, full and incremental catch-up. |
|
Incremental synchronization |
Continuously consumes source changes and writes them to the target. |
Offset lag, read and write throughput, failover, checkpoint, DDL events, source log retention. |
A real-time full-database synchronization task usually completes schema migration and full initialization first, and then continuously processes incremental synchronization.
Incremental data generated during full initialization depends on source log retention and the real-time pipeline's ability to catch up. Therefore, when starting, stopping, rerunning, or adding tables to a task, you must monitor both Full Initialization Progress and Real-time Latency.
Operations
Starting and stopping tasks
After starting a task, verify that it is running correctly by checking the following in order:
-
Verify that schema migration is complete and the table structures have been created in the target.
-
Observe the full initialization to ensure it is reading and writing data normally. The full read rate and write rate should be stable, and the number of completed tables should increase continuously.
-
Confirm that incremental synchronization is running normally. The real-time latency should be within a reasonable range, with no frequent failovers and successful checkpoint commits.
Before stopping a task, verify that the source log retention period is long enough to cover the downtime. If a task is stopped for too long, the source Binlog, WAL, or message logs may be purged, preventing the task from resuming from its last saved offset. When you resume the task, it will continue from the last saved offset. If the offset has expired, you must assess the risks of resetting the offset or re-initializing the task. For more information, see "Cannot resume a task from its previous offset after being stopped" in the FAQ section.
Modifying configurations
For a running real-time full-database synchronization task, you might need to add or delete tables, adjust table mappings, or modify resources for full or incremental workloads. After modifying the configuration, submit and apply the updates by following the on-screen prompts.
|
Change type |
Risks |
Recommendations |
|
Add table |
Requires re-reading metadata and, depending on channel capabilities, running schema migration, full initialization, and initiating incremental synchronization. |
Confirm that the new tables match the selection rules, then refresh table mappings. Monitor the full initialization progress, incremental onboarding status, and source log retention for the new tables. |
|
Delete table |
Removing tables from a running task can impact existing table mappings, target data, and downstream dependencies. |
Before deleting, confirm that the business no longer depends on the table. In exceptional cases, create a new task to handle the new scope. |
|
Modify mapping rules |
May cause target table name conflicts, missing columns, partition changes, or changes in how existing data is handled. |
Before you submit, check target table names, column types, primary keys, partitions, additional columns, and existing data on the target. |
|
Adjust resources |
Resources, concurrency, and connection counts can differ between full initialization and incremental synchronization. Improper adjustments can increase the load on the source or target. |
Adjust resources incrementally. After each change, monitor the full initialization rate, real-time latency, failover, checkpoint, and resource utilization. |
How configuration changes take effect:
-
Refreshing table mappings and adding new tables usually do not require pausing the task.
-
Adjusting resource specifications, such as CU (Compute Unit), may require restarting the task or waiting for the next checkpoint to take effect. Follow the on-screen prompts.
-
We recommend modifying mapping rules while the task is paused to avoid impacting in-flight data.
-
If a configuration change fails, revert to the previous configuration and resubmit.
Alert configuration
For real-time full-database synchronization tasks, we recommend configuring alerts for at least the following events:
|
Alert type |
Use cases |
Description |
|
Abnormal task status |
All tasks |
Triggers an alert immediately if a task fails or exits unexpectedly. |
|
Business latency |
All tasks |
Triggers an alert when real-time latency exceeds the business-acceptable threshold. |
|
Failover |
All tasks |
Frequent failovers usually indicate a problem that requires manual intervention. |
|
Resource utilization |
Resource-constrained scenarios |
Triggers an alert when CPU, memory, or network utilization is excessively high. |
|
DDL notification |
Channels that process DDL events |
DDL events can affect the target schema and should be monitored. |
|
Backlog |
Kafka, DataHub, or LogHub sources |
Monitor the backlog for partitions, shards, or topics by using the source console or task metrics. |
For log-based sources like MySQL and PostgreSQL, you also need to monitor the retention period of logs such as Binlog and WAL to prevent task recovery failures due to log expiration.
For detailed steps on configuring alert rules, see Common Alerting Rules.
Troubleshooting methods
Schema migration failures
Schema migration failures are often caused by permissions, metadata reads, data types, or table creation on the target. Troubleshoot in the following order:
-
Check if the source account has permissions to read metadata, including databases, tables, columns, primary keys, and indexes.
-
Verify the network connectivity between the task's resource group and both the source and target.
-
Check if the target account has permissions to create tables, alter tables, and write data.
-
Check if mapping rules for table, database, or schema names generate duplicate or invalid names.
-
Check if data types, primary keys, partition columns, and additional columns are compatible with the target.
When selecting tables with regular expressions or in bulk, we recommend testing with a small number of tables before expanding the synchronization scope.
Slow or failed full initialization
If full initialization is slow or fails, first determine if the task is stuck during resource initialization, reading from the source, writing to the target, or waiting to catch up with incremental changes. When troubleshooting, check metrics like the full read rate, write rate, number of completed tables, remaining shards, source connection count, and target write latency.
|
Symptom |
Possible cause |
Recommendation |
|
Full initialization task does not start for a long time. |
Resource group queuing, resource initialization failure, or connectivity issues with the source or target. |
Check the resource group status and network connectivity. Verify that the source and target account permissions are correct. |
|
Low full read rate. |
Unevenly distributed split key, source SQL queries not using indexes, high source load, or insufficient connections or quota. |
Check the split key and indexes, and moderately adjust the full load concurrency. If the source load is high, apply rate limiting or run the task during off-peak hours. |
|
Low full write rate. |
Insufficient target write capacity, poorly designed partitions, or unsuitable batch write parameters. |
Check the target load, write QPS, batch commit latency, and number of partitions. |
|
Slow incremental catch-up after full initialization completes. |
A large volume of changes occurred at the source during full initialization, and the real-time pipeline needs to process the log backlog. |
Check if real-time latency is continuously decreasing and ensure the source log retention period is sufficient. |
High real-time latency
When real-time latency increases, first determine if the task is still in the full-initialization catch-up phase. Then, identify whether the bottleneck is at the reader, in the processing pipeline, or at the writer.
|
Symptom |
Possible cause |
Recommendation |
|
Latency remains high after full initialization is complete. |
Large backlog of incremental changes accumulated during full initialization; slow reading of source logs; slow catch-up writing to the target. |
Monitor the incremental read rate, write rate, and source log retention time to confirm that latency is continuously decreasing. |
|
High reader wait time. |
Sudden increase in source change volume, large transactions, source log backlog, or partition/shard skew. |
Check for source write spikes, log growth, and partition or shard distribution. |
|
High writer wait time. |
Slow target write performance, rate limiting, insufficient connections, or too many dynamic partitions. |
Check the target resources, write QPS, batch commit latency, and partition design. |
|
Frequent failover. |
Insufficient memory, external service instability, checkpoint failures, or DDL processing errors. |
Examine logs from before and after the failover. Analyze memory, resource utilization, and checkpoint metrics to resolve the issue. |
|
Latency increases after a DDL event. |
Time-consuming DDL processing or a failure to alter the schema on the target. |
Review the DDL event and target permissions to confirm that the DDL handling policy is as expected. |
If the source is Kafka, DataHub, or LogHub, a single partition or shard can typically only be consumed by one concurrent process. If data is concentrated in a few partitions, increasing the overall task concurrency may not be effective.
New tables not synchronizing
If a new table is not synchronizing, check the following in order:
-
Does the new table match the current database and table selection rules?
-
Whether the table mapping was successfully refreshed.
-
Check that tables have been created or the schema has been migrated on the target.
-
Whether the task supports adding tables dynamically at runtime.
-
Check the run details for the schema migration, full initialization, or real-time events of the corresponding table.
For channels that do not support dynamically adding tables, you must modify the configuration and republish, or create a new task to handle the new tables. If the new tables require historical data, you must confirm whether the channel will perform a full initialization for them. If this feature is not supported, use a data backfill or a separate full synchronization capability.
Tuning recommendations
|
Tuning item |
Scenario |
Recommendation |
|
Full load resource specifications / Compute Unit (CU) |
Full initialization is queued, run speed is low, or resource group utilization is high. |
Gradually increase full load resources or run tasks during off-peak hours. Monitor the full load read rate, write rate, and source load. |
|
Full load concurrency and split key |
The full load phase runs, but the overall speed is slow. |
Choose a split key that is evenly distributed and has an index. Increase concurrency moderately. If the source load is high, reduce concurrency or apply rate limiting. |
|
Source connections / Quota |
Full initialization reports errors related to connections, Quota, or rate limiting. |
Reduce the concurrency of tasks from the same source. Or, increase the number of connections and the Quota if the source capacity allows. |
|
Incremental resource specifications / CU |
CPU, memory, network, or resource group utilization is high. |
Gradually increase resources. Observe if latency, failover, and checkpoint performance improve. |
|
Incremental concurrency |
Multiple tables, partitions, or shards have sufficient parallelism. |
First, confirm that there are no single-table hotspots or single-shard bottlenecks. Then, increase concurrency. |
|
Checkpoint / Flush interval |
The target commits frequently, or the overhead for batch commits is high. |
Slightly increase the interval and observe the throughput and data visibility latency. Do not increase it too much at once. |
|
Target batch write parameters |
The target wait time is high. |
Adjust batch, flush, commit, or connection pool parameters based on the target product's limitations. |
|
Dynamic partition granularity |
The target has too many partitions or high flush pressure. |
Prioritize adjusting the partition granularity. Avoid using high-cardinality fields for partitioning, such as second-level timestamps, order IDs, or user IDs. |
Full initialization and incremental synchronization have different tuning goals. Full initialization focuses on stably completing historical data writes. Incremental synchronization focuses on continuously catching up with incremental latency. After tuning, observe the performance for at least one stable window. Judging the effect based only on the short-term throughput after a restart can be misleading.
For recommended resource settings, see Recommended CUs for data integration. Adjust the settings as needed.
Frequently asked questions
The following questions are based on troubleshooting history for real-time full database synchronization tasks and are categorized by task phase. When troubleshooting, first confirm the task's current phase. Then, cross-verify using task events, run logs, metrics, source log retention, and target results.
Issues with schema migration
|
Issue |
Key areas to check |
Recommended action |
|
Table mapping is slow to refresh, times out, or tables are not selectable |
Resource group connectivity, source metadata permissions, number of databases and tables, number of fields, source object types, and data source cache |
First, narrow the scope of databases and tables to test. Confirm that the account has permissions to read databases, tables, fields, primary keys, and indexes. If an object is not selectable, its availability is determined by the supported scope of the current channel. |
Issues with full initialization
|
Issue |
Key areas to check |
Recommended action |
|
Full initialization is stuck or fails to start |
Resource group queue, source or target connectivity, data source quota, resources for the full initialization phase, and table creation results at the target |
First, confirm that schema migration is complete. Then, check if the full sub-tasks have started. If the quota or number of connections is insufficient, reduce concurrency or increase the quota for the source or target. |
Issues with incremental synchronization
|
Issue |
Key areas to check |
Recommended action |
|
Real-time latency remains high after full initialization is complete |
Incremental data accumulated during full initialization, source log read speed, target write speed, checkpoints, and failover |
Observe if the latency continuously decreases. If it does not, check the read offset, write wait time, checkpoint failures, and rate limiting at the target. |
|
The task cannot resume from its previous offset after being stopped |
Check if the retention period for Binlog, Write-Ahead Logging (WAL), message logs, or consumption offsets has expired. Check if the source instance was rebuilt or if its logs were cleared. |
Before stopping a task, confirm that the log retention period covers the expected downtime. If an offset is unavailable, you typically need to reset it or re-initialize the task. Assess the risks of data duplication and loss before proceeding. |
|
The task fails or latency increases after a Data Definition Language (DDL) operation |
The types of DDL operations the source can generate, the DDL handling actions the target supports, and the task's DDL strategy and permissions |
Review the DDL events and the results of the schema change at the target. Do not ignore unsupported DDL operations. First, confirm their impact on the target schema and data consistency. |
|
Target data is incorrect after DELETE or UPDATE operations |
Whether the target has a primary key or unique key to locate records. Whether the primary key mapping is consistent. Whether the write mode supports updates and deletes. |
Check the source primary key, target primary key, table mapping, and dirty data logs. If the target has no valid primary key or the mapping is inconsistent, UPDATE and DELETE operations may not be applied to the correct target records. |
Issues with message sources
|
Issue |
Key areas to check |
Recommended action |
|
High latency from message sources such as Kafka, DataHub, and LogHub |
Check for hotspots in partitions, shards, or topics. Check if consumption concurrency exceeds the number of partitions that can be consumed in parallel. Check if the consumption offset is lagging significantly. |
First, check for bottlenecks in a single partition or shard. If hotspots are concentrated, increasing total concurrency may not be effective. Adjust the source partitions or the target's write capability instead. |
Issues with writing to the target
|
Issue |
Key areas to check |
Recommended action |
|
Slow writes to targets such as MaxCompute due to an excessive number of partitions |
Granularity of dynamic partition fields, number of partitions within a single checkpoint, Tunnel or commit time, and target rate limiting |
Reduce the partition granularity or avoid using high-cardinality fields for partitioning. Then, adjust batch commits, partition cache, and resource specifications based on the target's capabilities. |
Issues with data consistency
|
Issue |
Key areas to check |
Recommended action |
|
Historical data is not synchronized after a new table is added |
Check if the new table matches the selection rules. Check if the table mapping was refreshed successfully. Check if the channel supports full initialization for new tables. Check if the new table is configured for incremental synchronization only. |
If you need historical data, confirm that full initialization is enabled or has been triggered for the new table. If this is not supported, use a data backfill process or a separate full synchronization task to populate the historical data. |
|
Data becomes inconsistent after a table is deleted during runtime or at the source |
Check if the deleted table is still matched by the task's rules. Check how the DDL strategy handles DROP TABLE. Check for any downstream dependencies on the target table. |
Before removing a table during runtime, confirm that no business processes depend on it. Deleting a table at the source does not automatically clean it up at the target. If needed, manage the target table separately according to your data governance policies. |
|
An abnormal amount of dirty data is generated |
Check if the task is configured to tolerate dirty data. Check for any changes to field types, lengths, primary keys, non-null constraints, or target write limits. |
Do not simply increase the dirty data threshold to allow the task to continue. First, determine if the dirty data will cause missing data or field anomalies at the target. Then, decide whether to fix the data, adjust the mapping, or temporarily tolerate the errors. |
PostgreSQL-specific issues
|
Issue |
Key areas to check |
Recommended action |
|
WAL files accumulate at the PostgreSQL source |
Replication slot lag, consumption offset, checkpoints, and whether the task is committing offsets correctly |
If the task is consuming data normally but the WAL size does not decrease, check the replication slot lag and offset commit status. This helps prevent the source disk from being filled by WAL files. |
High-risk operation checklist
Before removing tables, adding a large number of tables, restarting a task, rerunning a full initialization, or making major parameter adjustments, confirm each item on this checklist. If a check fails, take the recommended action before you proceed.
-
Is the source log retention period long enough for full initialization and incremental catch-up?
-
If the target has existing data or downstream dependencies, is a clear overwrite, append, or cleanup strategy in place before you rerun the full initialization?
-
Do the new tables require historical data, and does the current channel support full initialization for them?
-
Are there any unprocessed Data Definition Language (DDL) statements or frequent failovers?
-
Do alerts cover task status, full initialization exceptions, business latency, failover, and DDL?