Causes and solutions for synchronization latency in RDS for MySQL read-only instances - ApsaraDB RDS

Read-only RDS instances use MySQL's native log-based asynchronous or semi-asynchronous replication to stay in sync with the primary instance. This mechanism can introduce a lag, which may cause temporary data inconsistencies or exhaust the storage capacity of the read-only instance as binary logs accumulate.

If a large number of logs are being generated for the primary RDS instance, its read-only RDS instances may be locked.

Use this document to interpret replication latency metrics, identify the root cause, and resolve the issue.

How replication latency is measured

The Seconds_Behind_Master field in SHOW SLAVE STATUS \G output reports the replication latency in seconds.

Latency formula: Latency = Current time − Point in time when the transaction being applied on the read-only instance was committed on the primary RDS instance

Run the following statement on the read-only instance to check its current replication status:

SHOW SLAVE STATUS \G

Key fields to examine:

Field	What to look for
`Slave_IO_Running`	Should be `Yes`. `No` indicates the replication is broken.
`Slave_SQL_Running`	Should be `Yes`. `No` indicates the replication is broken.
`Seconds_Behind_Master`	Replication latency in seconds. `0` means the read-only instance has caught up. `NULL` means replication is not active.
`Exec_Master_Log_Pos`	If this value stays unchanged while `Seconds_Behind_Master` keeps rising, the SQL thread is stuck on a large transaction or DDL operation.
`Last_IO_Error` / `Last_SQL_Error`	Error messages for the I/O thread or SQL thread, if any.

Latency <= 1 second: when to ignore it

A reported latency of 1 second or less typically does not indicate a real problem. Two common causes explain why:

Monitoring granularity. If the monitoring time range is set to more than 6 minutes, the default monitoring granularity can be as coarse as 30 seconds. The displayed value is an average over that interval, not an instantaneous reading. To get a 1-second granularity reading, go to the Performance Trends tab in the ApsaraDB RDS console and set the monitoring time range to less than 6 minutes.

Cross-second sampling. Sampling points are always rounded down to the nearest second. If a transaction starts in one second and completes in the next, the latency calculation reports 1 second regardless of the actual sub-second lag. The table below illustrates this behavior:

Transaction	Commit time on primary	Commit time on read-only	Current time	Reported latency
Trx1	00:00:00.30	00:00:00.50	00:00:00.35	0(0.35) - 0(0.30) = 0 s
Trx1	00:00:00.30	00:00:00.50	00:00:00.45	0(0.45) - 0(0.30) = 0 s
Trx2	00:00:00.90	00:00:01.10	00:00:00.95	0(0.95) - 0(0.90) = 0 s
Trx2	00:00:00.90	00:00:01.10	00:00:01.05	1(1.05) - 0(0.90) = 1 s

Set the alert threshold for read latency to greater than 1 second to avoid false alarms.

Latency > 1 second: identify and fix the cause

A sustained latency of more than 1 second requires investigation. The following sections cover the five most common causes, how to confirm each one, and how to resolve it.

Important

Before making configuration changes—such as modifying instance specifications or data—create a snapshot or enable log backup to protect your data. If you granted permissions on sensitive information or submitted sensitive information in the Alibaba Cloud Management Console, we recommend that you modify the sensitive information at the earliest opportunity. Sensitive information includes usernames and passwords.

Cause 1: Read-only instance specifications are too low

What happens. Replication uses an I/O thread (to fetch binary logs from the primary) and an SQL thread (to apply them locally). Both threads compete for I/O resources, including input/output operations per second (IOPS). If the read-only instance cannot sustain the required IOPS, it falls behind.

How to confirm. In the ApsaraDB RDS console, go to Monitoring and Alerts for the read-only instance. Check whether CPU utilization, memory usage, I/O bandwidth, or IOPS are consistently hitting their limits. You can also view the instance type in the Configuration Information section of the Basic Information page. For more information, see Instance types for read-only ApsaraDB RDS for MySQL instances (x86) and Instance types for read-only ApsaraDB RDS for MySQL instances (ARM).

How to fix. Upgrade the read-only instance to specifications equal to or greater than the primary instance. For more information, see Change the specifications of an ApsaraDB RDS for MySQL instance.

Cause 2: High transactions per second (TPS) on the primary instance

What happens. A single SQL thread applies all replicated transactions on the read-only instance serially. If the primary instance processes a high volume of concurrent writes, the read-only instance cannot keep up.

How to confirm. Go to the Dashboard page of the primary or read-only instance and check the TPS metric. For more information, see Use the dashboard feature for an ApsaraDB RDS for MySQL instance.

How to fix. Optimize or split large transactions on the primary instance to reduce per-transaction write volume and shorten the time each transaction holds the SQL thread.

Cause 3: Large transactions on the primary instance

What happens. Operations such as UPDATE, DELETE, INSERT...SELECT, and REPLACE...SELECT on large data volumes generate large binary log entries. The read-only instance must spend the same amount of time applying the transaction as the primary did executing it. For example, a deletion that takes 80 seconds on the primary also takes 80 seconds to replicate.

Additionally, while concurrent transactions on multiple tables are supported, only one thread replicates transactions on any given table at a time, which can compound the delay.

How to confirm. Run SHOW SLAVE STATUS \G on the read-only instance and watch for Seconds_Behind_Master continuously increasing while Exec_Master_Log_Pos stays the same. Then run SHOW PROCESSLIST to identify the specific SQL thread. If binary logging is set to ROW format, check whether any binary log file exceeds the max_binlog_size threshold:

SHOW BINARY LOGS;

If File_size exceeds max_binlog_size, large transactions are likely the cause.

Also check for metadata lock (MDL) contention by examining the output of SHOW SLAVE STATUS \G for lock wait states.

How to fix. Split large transactions into smaller batches. For example, add a WHERE clause with a row limit to DELETE statements so each batch processes a manageable number of rows, allowing the read-only instance to keep pace.

Cause 4: Long-running DDL on the primary instance

What happens. Replication applies Data Definition Language (DDL) statements serially. A DDL operation on a large table—such as CREATE INDEX, ALTER TABLE ADD COLUMN, or REPAIR TABLE—can block all subsequent replication until it completes. Slow queries that generate large numbers of temporary tables compound the issue by consuming disk I/O.

Separately, ongoing queries or open transactions on the read-only instance can create MDL contention that blocks DDL statements arriving from the primary.

How to confirm.

Check whether binary log files are accumulating without being truncated, which indicates a DDL backlog.
Run the following statement on the read-only instance to check for lock contention:
```
SHOW ENGINE INNODB STATUS \G
```
Run the following statement to identify tables currently in use:
```
SHOW OPEN TABLES;
```
Tables with in_use = 1 are locked and may be blocking replication.
Review slow query logs for DDL-related operations such as OPTIMIZE, ALTER, REPAIR, and CREATE. For more information, see Slow query log analysis.

How to fix.

Schedule DDL operations during off-peak hours.
Enable the automatic storage expansion feature, or set the threshold for manual storage expansion to 90%. For more information, see How do I scale an ApsaraDB RDS instance?
Enable SQL Explorer to monitor SQL performance regularly. For more information, see Use the SQL Explorer and Audit feature on an ApsaraDB RDS for MySQL instance.
Configure disk usage alerts or upgrade the instance specifications. For more information, see Change the specifications of an ApsaraDB RDS for MySQL instance.

If a DDL statement from the primary is blocked by a metadata lock on the read-only instance:

Run SHOW PROCESSLIST; on the read-only instance to find the SQL thread waiting for a metadata lock.
Terminate the session holding the lock to resume replication. For more information, see Use DMS to release metadata locks.

Cause 5: Table with a unique index but no primary key

What happens. When a table has no primary key and only a unique index that contains NULL values, the replication SQL thread uses the unique index's execution plan instead of the optimizer's preferred plan—even if a WHERE clause would otherwise allow a more efficient lookup. This forces full table scans during replication, significantly increasing logical reads and slowing down the SQL thread.

How to confirm. Run the following query on the primary instance to identify tables that have no primary key:

SELECT tab.table_schema AS database_name, tab.table_name
FROM information_schema.tables tab
LEFT JOIN information_schema.table_constraints tco
  ON tab.table_schema = tco.table_schema
  AND tab.table_name = tco.table_name
  AND tco.constraint_type = 'PRIMARY KEY'
WHERE tco.constraint_type IS NULL
  AND tab.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
  AND tab.table_type = 'BASE TABLE'
ORDER BY tab.table_schema, tab.table_name;

For tables returned by the query, use the sys.schema_index_statistics view to confirm whether the unique index contains NULL values.

How to fix. Add an explicit primary key to any table that lacks one.

FAQ

Why does a 1-second latency appear on some read-only instances but not others?

Each read-only instance starts its collection procedures at a slightly different time. If a sampling point happens to cross a second boundary, a 1-second latency is reported; if it does not, the latency reports as 0 seconds. This is a measurement artifact, not a real difference in replication performance. For details, see Latency <= 1 second: when to ignore it.

Does a 1-second latency mean my instance is affected?

No. A reported 1-second latency does not mean actual lag occurred. The value reflects calculation rounding rather than a genuine synchronization delay. Set the read latency threshold for your proxy or alert to greater than 1 second to filter out these false positives.

What do I do when the ReplicationInterrupted error is displayed, or when a Slave_SQL_Running or Slave_IO_Running alert is triggered?

Check the following in order:

Storage capacity. If the read-only instance runs out of storage, it cannot write incoming binary logs. Check the storage usage in the Usage Statistics section of the Basic Information page.
Replication latency. If the latency continuously exceeds 5 seconds within a 5-minute window, an alert is triggered. High write volume or large transactions on the primary are the most likely cause. To view the current latency, go to Monitoring and Alerts > Standard Monitoring and check the Replication Latency of Secondary Instances(second) metric.
Slow query logs. Slow queries on the read-only instance degrade SQL thread performance and can amplify replication lag. Check the logs under Logs > Slow Query Logs, or go to Autonomy Services > Slow Query Logs.

If none of the above apply, the system automatically detects and resolves the replication interruption.