All Products
Search
Document Center

ApsaraDB RDS:Causes and solutions for synchronization latency in RDS for MySQL read-only instances

Last Updated:Apr 01, 2026

Read-only RDS instances use MySQL's native log-based asynchronous or semi-asynchronous replication to stay in sync with the primary instance. This mechanism can introduce a lag, which may cause temporary data inconsistencies or exhaust the storage capacity of the read-only instance as binary logs accumulate.

If a large number of logs are being generated for the primary RDS instance, its read-only RDS instances may be locked.

Use this document to interpret replication latency metrics, identify the root cause, and resolve the issue.

How replication latency is measured

The Seconds_Behind_Master field in SHOW SLAVE STATUS \G output reports the replication latency in seconds.

Latency formula: Latency = Current time − Point in time when the transaction being applied on the read-only instance was committed on the primary RDS instance

Run the following statement on the read-only instance to check its current replication status:

SHOW SLAVE STATUS \G

Key fields to examine:

Field What to look for
Slave_IO_Running Should be Yes. No indicates the replication is broken.
Slave_SQL_Running Should be Yes. No indicates the replication is broken.
Seconds_Behind_Master Replication latency in seconds. 0 means the read-only instance has caught up. NULL means replication is not active.
Exec_Master_Log_Pos If this value stays unchanged while Seconds_Behind_Master keeps rising, the SQL thread is stuck on a large transaction or DDL operation.
Last_IO_Error / Last_SQL_Error Error messages for the I/O thread or SQL thread, if any.

Latency <= 1 second: when to ignore it

A reported latency of 1 second or less typically does not indicate a real problem. Two common causes explain why:

Monitoring granularity. If the monitoring time range is set to more than 6 minutes, the default monitoring granularity can be as coarse as 30 seconds. The displayed value is an average over that interval, not an instantaneous reading. To get a 1-second granularity reading, go to the Performance Trends tab in the ApsaraDB RDS console and set the monitoring time range to less than 6 minutes.

Cross-second sampling. Sampling points are always rounded down to the nearest second. If a transaction starts in one second and completes in the next, the latency calculation reports 1 second regardless of the actual sub-second lag. The table below illustrates this behavior:

Transaction Commit time on primary Commit time on read-only Current time Reported latency
Trx1 00:00:00.30 00:00:00.50 00:00:00.35 0(0.35) - 0(0.30) = 0 s
00:00:00.45 0(0.45) - 0(0.30) = 0 s
Trx2 00:00:00.90 00:00:01.10 00:00:00.95 0(0.95) - 0(0.90) = 0 s
00:00:01.05 1(1.05) - 0(0.90) = 1 s

Set the alert threshold for read latency to greater than 1 second to avoid false alarms.

Latency > 1 second: identify and fix the cause

A sustained latency of more than 1 second requires investigation. The following sections cover the five most common causes, how to confirm each one, and how to resolve it.

Important

Before making configuration changes—such as modifying instance specifications or data—create a snapshot or enable log backup to protect your data. If you granted permissions on sensitive information or submitted sensitive information in the Alibaba Cloud Management Console, we recommend that you modify the sensitive information at the earliest opportunity. Sensitive information includes usernames and passwords.

Cause 1: Read-only instance specifications are too low

What happens. Replication uses an I/O thread (to fetch binary logs from the primary) and an SQL thread (to apply them locally). Both threads compete for I/O resources, including input/output operations per second (IOPS). If the read-only instance cannot sustain the required IOPS, it falls behind.

How to confirm. In the ApsaraDB RDS console, go to Monitoring and Alerts for the read-only instance. Check whether CPU utilization, memory usage, I/O bandwidth, or IOPS are consistently hitting their limits. You can also view the instance type in the Configuration Information section of the Basic Information page. For more information, see Instance types for read-only ApsaraDB RDS for MySQL instances (x86) and Instance types for read-only ApsaraDB RDS for MySQL instances (ARM).

How to fix. Upgrade the read-only instance to specifications equal to or greater than the primary instance. For more information, see Change the specifications of an ApsaraDB RDS for MySQL instance.

Cause 2: High transactions per second (TPS) on the primary instance

What happens. A single SQL thread applies all replicated transactions on the read-only instance serially. If the primary instance processes a high volume of concurrent writes, the read-only instance cannot keep up.

How to confirm. Go to the Dashboard page of the primary or read-only instance and check the TPS metric. For more information, see Use the dashboard feature for an ApsaraDB RDS for MySQL instance.

How to fix. Optimize or split large transactions on the primary instance to reduce per-transaction write volume and shorten the time each transaction holds the SQL thread.

Cause 3: Large transactions on the primary instance

What happens. Operations such as UPDATE, DELETE, INSERT...SELECT, and REPLACE...SELECT on large data volumes generate large binary log entries. The read-only instance must spend the same amount of time applying the transaction as the primary did executing it. For example, a deletion that takes 80 seconds on the primary also takes 80 seconds to replicate.

Additionally, while concurrent transactions on multiple tables are supported, only one thread replicates transactions on any given table at a time, which can compound the delay.

How to confirm. Run SHOW SLAVE STATUS \G on the read-only instance and watch for Seconds_Behind_Master continuously increasing while Exec_Master_Log_Pos stays the same. Then run SHOW PROCESSLIST to identify the specific SQL thread. If binary logging is set to ROW format, check whether any binary log file exceeds the max_binlog_size threshold:

SHOW BINARY LOGS;

If File_size exceeds max_binlog_size, large transactions are likely the cause.

Also check for metadata lock (MDL) contention by examining the output of SHOW SLAVE STATUS \G for lock wait states.

How to fix. Split large transactions into smaller batches. For example, add a WHERE clause with a row limit to DELETE statements so each batch processes a manageable number of rows, allowing the read-only instance to keep pace.

Cause 4: Long-running DDL on the primary instance

What happens. Replication applies Data Definition Language (DDL) statements serially. A DDL operation on a large table—such as CREATE INDEX, ALTER TABLE ADD COLUMN, or REPAIR TABLE—can block all subsequent replication until it completes. Slow queries that generate large numbers of temporary tables compound the issue by consuming disk I/O.

Separately, ongoing queries or open transactions on the read-only instance can create MDL contention that blocks DDL statements arriving from the primary.

How to confirm.

  • Check whether binary log files are accumulating without being truncated, which indicates a DDL backlog.

  • Run the following statement on the read-only instance to check for lock contention:

    SHOW ENGINE INNODB STATUS \G
  • Run the following statement to identify tables currently in use:

    SHOW OPEN TABLES;

    Tables with in_use = 1 are locked and may be blocking replication.

  • Review slow query logs for DDL-related operations such as OPTIMIZE, ALTER, REPAIR, and CREATE. For more information, see Slow query log analysis.

How to fix.

If a DDL statement from the primary is blocked by a metadata lock on the read-only instance:

  1. Run SHOW PROCESSLIST; on the read-only instance to find the SQL thread waiting for a metadata lock.

  2. Terminate the session holding the lock to resume replication. For more information, see Use DMS to release metadata locks.

Cause 5: Table with a unique index but no primary key

What happens. When a table has no primary key and only a unique index that contains NULL values, the replication SQL thread uses the unique index's execution plan instead of the optimizer's preferred plan—even if a WHERE clause would otherwise allow a more efficient lookup. This forces full table scans during replication, significantly increasing logical reads and slowing down the SQL thread.

How to confirm. Run the following query on the primary instance to identify tables that have no primary key:

SELECT tab.table_schema AS database_name, tab.table_name
FROM information_schema.tables tab
LEFT JOIN information_schema.table_constraints tco
  ON tab.table_schema = tco.table_schema
  AND tab.table_name = tco.table_name
  AND tco.constraint_type = 'PRIMARY KEY'
WHERE tco.constraint_type IS NULL
  AND tab.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
  AND tab.table_type = 'BASE TABLE'
ORDER BY tab.table_schema, tab.table_name;

For tables returned by the query, use the sys.schema_index_statistics view to confirm whether the unique index contains NULL values.

How to fix. Add an explicit primary key to any table that lacks one.

FAQ

Why does a 1-second latency appear on some read-only instances but not others?

Each read-only instance starts its collection procedures at a slightly different time. If a sampling point happens to cross a second boundary, a 1-second latency is reported; if it does not, the latency reports as 0 seconds. This is a measurement artifact, not a real difference in replication performance. For details, see Latency <= 1 second: when to ignore it.

Does a 1-second latency mean my instance is affected?

No. A reported 1-second latency does not mean actual lag occurred. The value reflects calculation rounding rather than a genuine synchronization delay. Set the read latency threshold for your proxy or alert to greater than 1 second to filter out these false positives.

What do I do when the ReplicationInterrupted error is displayed, or when a Slave_SQL_Running or Slave_IO_Running alert is triggered?

Check the following in order:

  1. Storage capacity. If the read-only instance runs out of storage, it cannot write incoming binary logs. Check the storage usage in the Usage Statistics section of the Basic Information page.

  2. Replication latency. If the latency continuously exceeds 5 seconds within a 5-minute window, an alert is triggered. High write volume or large transactions on the primary are the most likely cause. To view the current latency, go to Monitoring and Alerts > Standard Monitoring and check the Replication Latency of Secondary Instances(second) metric.

  3. Slow query logs. Slow queries on the read-only instance degrade SQL thread performance and can amplify replication lag. Check the logs under Logs > Slow Query Logs, or go to Autonomy Services > Slow Query Logs.

If none of the above apply, the system automatically detects and resolves the replication interruption.