Best practices for RTO optimization in AliSQL - ApsaraDB RDS

The recovery time objective (RTO) is the amount of time that is required to recover from a failure. It is a core metric that is used to measure the high availability of a database system. AliSQL, developed based on MySQL Community Edition, significantly optimizes the crash recovery process, accelerates instance startup, reduces the RTO, and ensures rapid service recovery.

Optimization overview

Optimization item	Optimized object or process	Optimization effectiveness
Massive table startup optimization	Tablespace check	The time for a normal instance startup is reduced by 73%. The time to recover from downtime is reduced by more than 95%. The memory usage during a startup is reduced by more than 81%.
Buffer pool initialization acceleration	Parallel buffer pool initialization	The time to initialize a buffer pool is reduced by more than 80%.
Parallel redo log application	Redo log scanning and application	The time to apply redo logs is reduced by more than 80%.
Fast transaction recovery	Undo log record structure	The time to recover large transactions with 1 million records is approximately 0 seconds, which can be ignored.
Asynchronous transaction rollback	Uncommitted transaction rollback	The time to roll back large transactions with 1 million records is approximately 0 seconds, which can be ignored.
Optimization of general query log recovery	Corrupted data scanning and locating	The time to recover a general query log table of 10 GB is approximately 0 seconds, which can be ignored.

Massive table startup optimization

Tablespace checking is fundamental to crash recovery. If a large number of tables, such as 1 million tables, exist in your RDS instance, the instance startup and crash recovery processes slow down and a large number of memory resources are occupied. AliSQL has optimized core processes, such as file scanning, data dictionary reading, and table object construction. This significantly reduces the time that is required to scan tables in scenarios in which a large number of tables are involved. The following table describes the optimization effectiveness.

Scenario	MySQL Community Edition (before optimization)	AliSQL (after optimization)	Improvement
Normal startup time	90.7 seconds	24.1 seconds	Reduced by more than 73%
Crash recovery time	521.7 seconds	25 seconds	Reduced by more than 95%
Memory usage during startup	High	Reduced by more than 81%	Significantly optimized

How to enable

Minor engine version: If your RDS instance runs MySQL 8.0, the minor engine version must be later than or equal to 20240731.
Key parameter: loose_fetch_raw_tablespace_at_startup. This parameter is enabled by default. If this parameter is enabled, only necessary metadata is loaded during the instance startup, and the loading of redundant information is delayed. This reduces memory usage and startup time.

Buffer pool initialization acceleration

The buffer pool is the core memory structure of InnoDB. To initialize the buffer pool, memory allocation and management structure construction are required. MySQL Community Edition supports parallel buffer pool initialization at the instance level. However, concurrency bottlenecks still exist in lock structures and global statistics updates. AliSQL optimizes the parallel buffer pool initialization, which eliminates bottlenecks in the parallel initialization process and significantly reduces the time for the buffer pool initialization of instances with large memory capacity. The following table describes the optimization effectiveness.

Scenario	MySQL Community Edition (before optimization)	AliSQL (after optimization)	Improvement
Time to initialize a buffer pool of 512 GB	19.7 seconds	3.8 seconds	Reduced by more than 80%

How to enable

Minor engine version:
- If your RDS instance runs MySQL 8.0, the minor engine version must be later than or equal to 20220530.
- If your RDS instance runs MySQL 5.7, the minor engine version must be later than or equal to 20230531.
Key parameter: innodb_buffer_pool_init_optimize. This parameter is enabled by default. If this parameter is enabled, the system automatically optimizes the parallel buffer pool initialization to eliminate lock and synchronization bottlenecks.

Parallel redo log application

Redo logs are write-ahead logging (WAL) logs in MySQL databases to ensure data consistency. Redo log scanning and application are critical steps during crash recovery. MySQL Community Edition only supports single-threaded scanning and serialized application. As a result, several minutes are required to recover a large number of redo logs, such as 4 GB of redo logs. AliSQL introduces a pipeline mechanism to parallelize redo log scanning and application, which significantly improves the recovery speed. The following table describes the optimization effectiveness.

Scenario	MySQL Community Edition (before optimization)	AliSQL (after optimization)	Improvement
Time to apply active redo logs of 4 GB	118.7 seconds	22.6 seconds	Reduced by more than 80%
Application speed	Baseline speed	Increased by more than 500%	Significantly improved

How to enable

Minor engine version: If your RDS instance runs MySQL 8.0, the minor engine version must be later than or equal to 20241231.
Key parameter: innodb_parallel_redo_threads. Default value: 1. This parameter is disabled by default. You must manually set the innodb_parallel_redo_threads parameter to a value greater than 2, such as 4. This parameter specifies the number of threads for parallel redo Log application. You can configure this parameter to accelerate redo log scanning and application.

Fast transaction recovery

During crash recovery, after redo logs are applied, MySQL needs to use undo logs to recover table locks for active transactions. MySQL Community Edition needs to scan all undo log records row by row to obtain information about the tables that are managed by the transactions. As a result, the time that is required to recover large transactions, such as transactions with millions of records, is significantly increased. AliSQL extends the structure of undo log records to implement fast undo log locating and scanning. This significantly reduces the time to recover transactions. The following table describes the optimization effectiveness.

Scenario	MySQL Community Edition (before optimization)	AliSQL (after optimization)	Improvement
Time to recover large transactions with 1 million records	6.8 seconds	Approximately 0 seconds, which can be ignored	Completely optimized

How to enable

Minor engine version:
- If your RDS instance runs MySQL 8.0, the minor engine version must be later than or equal to 20230630.
- If your RDS instance runs MySQL 5.7, the minor engine version must be later than or equal to 20230831.
Key parameter: innodb_trx_resurrect_table_lock_accelerate. Default value: OFF. This parameter is disabled by default. You must manually set the innodb_trx_resurrect_table_lock_accelerate parameter to ON. If this parameter is enabled, the system enables the fast undo log scanning mechanism to accelerate large transaction recovery.

Asynchronous transaction rollback

MySQL uses internal XA transactions to ensure consistency between binary logs and InnoDB. To commit a transaction, the system first executes the InnoDB prepare phase and keeps the transaction prepared, writes the prepared transaction to binary logs, and then commit the transaction. During crash recovery, the commit or rollback of uncommitted transactions is determined by binary logs. Transactions in the prepared state are considered uncommitted transactions. In MySQL Community Edition, uncommitted transaction rollback is a synchronous operation. This causes the time to recover large transactions, such as transactions with millions of records, to be significantly increased. AliSQL changes uncommitted transaction rollback to asynchronous, which does not block instance startup and allows the instance to quickly recover. The following table describes the optimization effectiveness.

Scenario	MySQL Community Edition (before optimization)	AliSQL (after optimization)	Improvement
Time to roll back large transactions with 1 million records	100 seconds	Approximately 0 seconds, which can be ignored	Completely optimized

How to enable

Minor engine version:
- If your RDS instance runs MySQL 8.0, the minor engine version must be later than or equal to 20220530.
- If your RDS instance runs MySQL 5.7, the minor engine version must be later than or equal to 20230531.
Key parameter: async_binlog_recovery. This parameter is enabled by default. If this parameter is enabled, the system enables the mechanism to asynchronously roll back uncommitted transactions to accelerate instance startup.

Optimization of general query log recovery

In TABLE mode, MySQL general query logs record user operations in a CSV table. If corruption is detected on this table during crash recovery, MySQL Community Edition performs a full table scan to locate and rebuild the corrupted data. This causes the time to recover large data volumes, such as 10 GB, to be significantly increased. AliSQL uses techniques such as reverse scanning and file truncation to quickly recover the corrupted log table, which significantly reduces the instance startup delay and accelerates instance recovery. The following table describes the optimization effectiveness.

Scenario	MySQL Community Edition (before optimization)	AliSQL (after optimization)	Improvement
Time to recover a general query log table of 10 GB	270 seconds	Approximately 0 seconds, which can be ignored	Completely optimized

How to enable

Minor engine version:
- If your RDS instance runs MySQL 8.0, the minor engine version must be later than or equal to 20241130.
- If your RDS instance runs MySQL 5.7, the minor engine version must be later than or equal to 20241130.
Key parameter: reverse_repair_general_log_on_boot. This parameter is enabled by default. If this parameter is enabled, the system enables reverse scanning and file truncation to accelerate the recovery of general query log tables.

Other optimizations

AliSQL also optimizes the process to verify data in the doublewrite buffer. We recommend that you update the minor engine version to the latest version.