By Baotiao
A core concept of a cloud-native database is the decoupling of compute and storage (separation of compute and storage). This decoupling divides the database system into two independent layers: the compute layer (responsible for query and transaction processing) and the storage layer (managing the persistence of logs and data pages). Both can be scaled independently. In the paper "CloudJump: Optimizing Cloud Database For Cloud Storage," we analyzed the challenges posed to database design by the medium change after the storage layer switches from local storage to cloud storage under the architecture of separation of compute and storage. We proposed a series of optimization frameworks to address these challenges and leverage the advantages of the separation of compute and storage. Furthermore, after shared storage adopts the separation of compute and storage, a natural and very important choice is to allow multiple compute nodes to share the same remote storage. This brings a series of advantages to the database, such as efficient compute elasticity, fast and atomic node switch, and lower primary/secondary latency. As with the separation of compute and storage, shared storage also brings new challenges and opportunities to databases. To this end, in the CloudJump II paper, we analyze in detail the problems faced in shared storage databases, propose the MVD (Multi-Version Data) technology to address the challenges, and explore and obtain more advantages of the shared storage architecture.
This paper has been published in SIGMOD 2025. Interested readers can download and read it: "CloudJump II: Optimizing Cloud Databases for Shared Storage".
After cloud databases decouple the compute layer and the storage layer, it is a natural choice to allow multiple compute nodes to access the same cloud storage. As shown in the figure, multiple compute nodes, including one read/write Leader (RW) node and multiple read-only Follower (RO) nodes, connect to a storage cluster composed of many data nodes via network. Compared with traditional primary/secondary structure databases or "shared-nothing" databases represented by Google Spanner, the greatest feature of this architecture is that it completely eliminates data copying between nodes, thereby obtaining some obvious advantages, such as:
• During normal operation, the Follower node does not need to pull logs from the Leader node and completely replay the operations on the Leader. Instead, only synchronization of metadata including the write log offset and maintenance and update of some memory statuses are required. Therefore, you can achieve very low (ms-level) primary/secondary latency with very low network I/O overhead.
• When compute nodes are added, data copying is not required. Instead, the nodes directly interface with the same storage data and provide services after necessary memory status initialization is performed. Thus, you can obtain data volume-independent fast elasticity.
Although different compute nodes can share the same storage data, the memory status of different nodes is still independent. The most direct example is the page cache Buffer Pool maintained by each node in memory. The read-only node still needs to obtain the latest updates on the read/write node, including those in the page cache, through log replay, and use this to update its own memory status. However, unlike the traditional primary/secondary structure, different nodes of a shared storage database require a completely consistent physical data view. This is not satisfied by the traditional replication method based on logical logs such as binary logging. Instead, a replication method based on physical logs is required, such as Redo logs in MySQL. We call this replication method physical replication.

Aurora As the earliest industrial-grade shared storage database product employing physical replication that meets the criteria discussed above, based on the consideration of minimizing network I/O overhead to the extreme, it further customized the storage layer into a dedicated Page service. It provides services for writing Redo and reading specified Versions of Pages to compute nodes, while there is completely no Page write Traffic from the compute layer to the storage layer. Instead, the storage layer independently replays Redo logs to advance page versions. Because the modification volume of Redo is usually far smaller than the size of a Page, this method can significantly reduce network write Traffic. This approach by Aurora has profoundly impacted various subsequent cloud-native shared storage database products, including Microsoft's Socrates and similar products from major cloud vendors.
However, because this method deeply customizes the services of the storage layer, places more database complexity into the storage layer implementation, limits the storage layer's ability to utilize standard cloud storage services under continuous optimization by major cloud vendors, increases the complexity of storage layer implementation, and expands the fault radius. Therefore, we propose a different exploration: Based on the CloudJump framework, we use standard cloud storage services to build the shared storage layer of cloud databases. This solution enhances the flexibility and scalability of storage solutions and better meets the dynamic needs of cloud-native applications. With the help of standardized components, CloudJump can build high-quality cloud-native database services on multiple cloud platforms, better support the independent evolution of the storage layer, and better enjoy the storage layer upgrade dividends brought by new hardware and new architectures.

Taking the implementation of Alibaba Cloud PolarDB in the shared storage architecture as an example, we can understand this mechanism more clearly. PolarDB supports a Leader-Follower model based on a single shared dataset, including one read/write node (RW) and multiple read-only nodes (RO). What shared storage provides to compute nodes is a distributed storage service that supports a standard file system API. When a write operation is executed, the RW node generates a Redo log file and writes it to the shared storage. Each log is identified by a Log Sequence Number (LSN), corresponding to a specific Version of the database. At the same time, it notifies all RO nodes of the LSN location of the latest Redo log via the Network. RO nodes need to replay these logs to synchronize the latest updates, including data pages in the Buffer Pool, transaction resolution, and various memory cache structures. This synchronization mechanism is called Active Log Update Chasing. In addition, for Pages that were not previously in the memory Buffer Pool, they are loaded from the shared storage when accessed for the first time via the Passive On-Demand Access mechanism, and recovered to the latest status required by the RO by replaying Redo logs.
As introduced before, in this shared storage database adopting the Leader-Follower model, the memory data update of the RO node depends on the asynchronous replay of Redo logs, while the data in the shared storage is updated via dirty page flushing on the RW. Therefore, a data Consistency problem exists on the RO. Take the following B+Tree node split scenario as an example:

As shown in Figure (a), this is the initial status of a Page in the Buffer Pool in the memory of the RW and RO. At this time, Insert 90 causes Page 8 to split, generating a new Page 9 that owns some elements previously on Page 8, and 97 is one of them, as shown in Figure (b). Then, the RW performs dirty page flushing on Page 8, causing Page 8 on the shared storage to be overwritten by the latest Version. Because of asynchronous replication, the Buffer Pool on the RO actually remains in the status before the split occurred. At this time, a request initiates a search for element 97. Naturally, the RO locates Page 8 via the B+Tree. Because Page 8 is not in memory, the RO initiates a Page read from the shared storage and sees the post-split status of Page 8, which does not contain element 97. Note that this situation is completely incorrect and is not a normal eventually consistent status. We analyze this situation: because the latency of the asynchronous replication RO is allowed, the RO should see a complete historical status at which it is located, that is, Page 8 possessing element 97 before the split occurred.
The key to solving this problem is to enable the *RO to obtain a data version that it needs, is consistent, and may lag behind the RW. For Pages cached in the Buffer Pool of the RO, through the active log append mechanism, as the RO's own offset advances and Redo is applied, the Pages can be easily maintained at a correct position. The troublesome part lies in Pages that were not previously in the RO memory, such as Page 8 in the graph above. To achieve this, it must be ensured that when the RO accesses such a Page, 1) it must be able to obtain a Page version from shared storage that is older than the current RO's LSN offset, and 2) it can obtain all Redo logs for the current Page after this old Page version, and obtain the required version by applying the logs. To this end, a feasible solution is to add the following constraints to the compute nodes:

In the Buffer Pool of the RW node, a modified Page (dirty page) maintains the start LSN of its earliest modification (oldest_modification_lsn) and the end LSN of its last modification (newest_modification_lsn). When the RW writes a dirty page to shared storage, the RW must ensure that the newest_modification_lsn of the page does not exceed the log LSN currently applied by any RO node (newest_applied_lsn). The purpose of doing this is to prevent any RO node from obtaining a "future" data page that is too far ahead. Through this guarantee, Page 8, which was split in the example above, will not be written to shared memory by the RW and seen by the RO.
When the RO reads a data page, the RO must process all Redo logs related to the page in its log parse buffer and apply these modifications to the data page obtained from shared storage to ensure that the data page is updated to the latest status. To ensure that the logs are sufficient, the RO needs to maintain all Redo logs after the minimum oldest_modification_lsn of Pages in the current buffer pool of the RW.
However, this solution of solving the consistency problem of reading data pages from shared storage through dual constraints is not ideal because it brings defects in performance and flexibility. First, because RW dirty page flushing is restricted by the replication delay of the slowest RO, a large number of Pages in the RW buffer may not be flushed in time, causing a decline in its buffer efficiency and affecting the performance of normal read and write requests. Second, for frequently modified Pages, because their newest_modification_lsn remains at a high level, they are difficult to be flushed. Consequently, such Pages tend to hold a wide [oldest_modification_lsn, newest_modification_lsn] range, which leads to the RO needing to maintain more Redo logs in the log parse buffer, resulting in RO memory usage increase or even overflow.
The root cause of the above problems is that page modifications on shared storage force complete data overwrite, because multi-versioning is not common under cloud storage or POSIX protocols. CloudJump solves this problem through the Integration of MVD (Multi-Version Data), enabling read and write operations of multiple valid versions within the compute node, thereby overcoming the limitation of single version overwrite.

Cloudjump integrates the MVD module between the storage engine and the storage layer within the compute node. In the Leader node (RW), Redo is simultaneously organized into a Page-indexed Redo Hash during the generation process, as shown in Sequence Redo Hash, and integrated into Page Redo Hash to dirty page requirements. Flush Pool the write back of dirty pages. Unlike the Buffer Pool, it does not maintain the entire Page content here, but only maintains the incremental Redo of this Page in Page Redo Hash . When a Page undergoes dirty page flushing, it is determined whether to enter the Flush Pool for caching based on conditions such as the modification amount. On the Follower node (RO), during the active log append process, as its own offset advances, it maintains the latest segment of In_memory Redo Hash, and a Persistent Redo Hash to index required Redo logs by Page, thereby supporting the RO node in obtaining the precise page version it needs by applying Redo.
The MVD engine provides the ability to access pages with any LSN within the range from the GC version to the latest version. This major feature supports an architecture with multiple database nodes, meeting different requirements for page versions. It also supports shared storage databases based on this capability, achieving more advantages compared to traditional databases, which we will mention later. This log-centric design is established on the basis that Redo logs ensure completeness and locality:
• Completeness: The Redo log contains all information about database modifications.
• Locality: Each Redo log involves only a single page. Therefore, the usage procedure can focus only on a single page, improving efficiency and accuracy.
There is a critical procedure in MVD: When a data page is requested, all Redo logs missing from that page need to be retrieved. Therefore, it is necessary to design a mechanism to categorize Redo logs by page, which is called Log Index. It is not easy to maintain the Log Index during the operation of the database. The main challenges include: 1) The generation of Redo logs itself is extremely optimized. In modern databases, write efficiency is also significantly improved through technologies such as multi-threaded, lock-free, and sharding. 2) The modification content for the same Page is distributed quite dispersedly in the Redo log file. This causes problems of performance degradation and metadata volume expansion when the Log Index is maintained. 3) Redo files are generated sequentially and continuously. We cannot know which Pages future modifications will involve. Therefore, maintaining a global Log Index is unrealistic. 4) The overall resources of the database are limited and important. Then, how can the generation rate of the Log Index be kept synchronized with the rapidly generated Redo logs without significantly increasing CPU and IO resources?

To address these issues, we adopted a log index generation method using asynchronous segmented sorting (Batch). This method retains the standard Redo log write flow and uses the Redo Buffer to temporarily save the latest Redo log segments. Subsequently, an asynchronous parsing thread reads these logs, parses them, and generates log index segments (Ranges) organized by Page. The log index segment (Range) for all Pages within a Batch constitutes the Sequence Redo Hash mentioned above. When the total quantity of log indexes accumulates to a certain amount (Batch), they are batch flushed to persistent storage. Balancing the memory usage of the Sequence Redo Hash, the IO overhead of writing log indexes to disk, and the degree of Page aggregation within a Batch, a practical value is to set a Batch to, for example, 100 MB. When the log index is persisted, it is written to ib_parsedata in an append-only manner, and a piece of memory header information recording the Page and the corresponding log Range Offset is updated. This header information is periodically overwritten to the ib_parsemeta file. Because the In-memory Redo Hash is maintained on the RO node during the advancement of the active offset, only Redo logs exceeding this range need to be retrieved from the Persistent Redo Hash. Therefore, a latency of 500 MB to 1 GB is allowed for log index creation here, which provides significant space for IO merging during the Log Index generation process. Testing shows that the overhead caused by this log index generation method is minimal, including approximately 3% to 5% CPU overhead for parsing and maintaining the memory Redo Hash, approximately 100 MB of memory overhead for caching the Sequence Redo Hash Batch, and the IO overhead for append-only writing of the Logindex.
During operation, the RO node continuously reads Redo logs from shared storage, parses them, and updates existing Pages in the Buffer Pool and various memory statuses. During this process, a memory log index organized based on Pages (i.e., the In-memory Redo Hash in the graph) is also generated synchronously. If a user request accesses a new Page and the LSN of that Page in shared storage has expired, the system needs not only the In-memory Redo Hash but also to load logs via the Persistent Redo Hash, corresponding to the indexes maintained in the ib_parsedata and ib_parsemeta files. By arranging Redo records by Page, these records can be methodically applied to the Page to recover it to the target version. Therefore, the implementation of Log Indexing protects the RO node from the impact of memory extension issues, effectively resolving constraint 2, and indirectly resolving constraint 1 by maintaining the optimal Apply LSN.
By maintaining log indexes, including memory and persistence, the MVD engine enables the DB to possess the capability to: obtain any version of a Page online at any time by using an older Page version and subsequent Redo logs. This not only resolves the issue of primary-secondary consistency under the shared storage architecture but also allows for the evolution of more DB capabilities.
In a Write-Ahead Logging (WAL)-based database engine architecture, modifications to data pages typically generate short Redo records and cause the page to be marked as a "dirty page" in the Buffer Pool. Because the Buffer Pool capacity is limited, when free space is insufficient, strategies such as Least Recently Used (LRU) must be used to select pages to be evicted. If the evicted page is a dirty page, it must first be written to shared storage, thereby triggering a page-sized write IO operation. In scenarios where the data volume is much larger than the Buffer Pool capacity, this type of event becomes very frequent: once a page is loaded into the Buffer Pool and undergoes minor modifications, it may quickly be evicted and written to disk. This phenomenon, where a series of minor modifications triggers a large number of write IO operations, not only causes a waste of IO resources but may also become a bottleneck for database performance, which is a typical IO-Bound situation.

The Write Elision mechanism introduced by MVD provides a brand new solution—the procedure of skipping dirty page flushing when a Page is evicted, thereby avoiding Page-level IO operations. Subsequently, access to this Page retrieves the necessary Redo logs via Log Indexing to apply changes. As shown in the graph above, after the Page to be flushed from the Buffer Pool is selected based on policies such as LRU, the system enters a selection flow for the multi-version Write Elision policy. This flow comprehensively evaluates multiple factors, including current user payload, the degree of dirty page modification, and memory usage. For a Page selected for Write Elision, its corresponding Redo log will be used to fetch the Log Indexing and the corresponding Redo log from the Sequence Redo Hash via ID. These are then organized into the Page Redo Hash and managed by the Flush Pool, thereby skipping the current flushing epoch. Pages not selected are flushed to the disk in the traditional manner. After the IO operation is completed for subsequent access to this Page, the corresponding Page Redo Hash is fetched from the Flush Pool, and the relevant Redo logs are applied to reconstruct the complete Page Content. Pages in the Flush Pool will be written to persistent storage by the subsequent dirty page flushing mechanism or periodically checked by the Write Elision backend thread after the flushing conditions are met. Afterward, these Pages are removed from the Page Redo Hash.

The core hypothesis of Write Elision is that overall efficiency can be improved by aggregating multiple IO Requests for the same Page in the Flush Pool. Meanwhile, managing cache Pages outside the Flush Pool prevents excessive cache occupation. At the same time, with the Write Elision mechanism, there are more choices for the timing of dirty page flushing, which also alleviates the dirty page flushing constraint 1 between nodes mentioned above. From the experiment result shown in the graph above, it can be seen that this improvement is more obvious in scenarios where the data volume is larger relative to the Buffer Pool (such as 300 GB vs 30 GB) or where the Redo volume of a single modification is smaller relative to the Page Size (Multi-Index vs Sysbench).
Fault recovery is a key feature of the database system, aiming to revert the database status to the state before the fault occurred via logs. This feature is not only crucial for data recovery from faults but also Supports various Management operations throughout the product life cycle, especially in major change scenarios where restarting the database is required. The speed of fault recovery is crucial because it directly impacts when the User can re-access the database. Taking PolarDB as an example, the UNDO phase is performed asynchronously after the service restarts and does not prolong the startup time. Therefore, the part with the longest Duration in the recovery procedure is executing and applying Redo operations. This recovery procedure is roughly as follows:
1) Scan the Redo Log sequentially starting from the Checkpoint position until the last complete mtr is found.
2) During the scan, all encountered Redo Logs are continuously parsed, and Redo records are maintained in an in-memory Hash Map ordered by Page.
3) When the scan ends or the Hash Map occupies too much memory, an abnormal Page Apply is triggered. That is, the Redo records maintained in the Hash Map are used to replay the Page Content to obtain the updated page version. There are mainly three factors that cause uncontrollable Time in this procedure:

The single-page recovery capability introduced in MVD allows the time-consuming Redo phase to be postponed until after the service is published. By utilizing the page-oriented attribute of Redo logs and the high throughput of distributed shared storage, the down time can be significantly reduced, and the entire recovery procedure may be accelerated in the background. The improved flow includes the following steps.
1) Start scanning from the log index generation position instead of the Checkpoint.
2) Read directly from ib_parsemeta to identify which Pages participated in the Redo logs after the checkpoint, mark them as "Register Pages," and postpone the actual recovery of these Pages to be performed asynchronously after the instance provides services.
3) The instance provides services, and the IO procedure triggered by user requests or the background batch recovery Job threads trigger the actual revert of a single Page. The Page that completes recovery is removed from the Register Pages. The Table above shows how this MVD recovery policy significantly advances the Time when the instance service becomes active. It simplifies the process from originally needing to scan, parse, and apply the complete Redo log to only needing to scan a small segment of the log. In addition, MVD proposes segmented recovery to maximize the use of limited memory resources to accelerate background Page application.

As shown in the graph above, MVD's fast recovery policy can significantly reduce the Time during which the instance is unavailable, even in scenarios with CPU or IO bottlenecks.
In high-pressure scenarios, to improve horizontal read extension capability, the Quantity of RO nodes (read-only nodes) is usually quickly increased to achieve load sharing. Fast horizontal extension capability is a major advantage of the shared storage architecture. When an RO node just joins the cluster, to obtain the data updated in the RW memory relative to the shared storage, it needs to replay the Redo logs after the Checkpoint. In practice, an RW Checkpoint offset advancement can be triggered to avoid the situation where the RO needs to replay too many logs during this procedure, which leads to memory bloat. Regardless of which policy is used, it causes an increase in the Duration for the RO to join the cluster, impacting elasticity. However, after introducing MVD, this problem is fundamentally changed. When a new RO node connects to the RW node, it no longer needs to wait for the Page purge operation of the Buffer Pool. Instead, it can start the copy relationship directly based on the position of the log index. Any missing Redo logs can be retrieved from the log index as needed. In practice, this offset can be kept highly close to the latest write offset, thereby significantly improving the scale-out efficiency and system stability. The following is a comparison of the efficiency of an RO joining the cluster and providing services before and after introducing MVD.

During the use of the database, the restore or retrieval of Data is a very common business Request, used to deal with situations such as business operation errors or data faults. The speed of backup restore is directly related to the satisfaction cycle of usually urgent restore requests. The backup restore procedure includes the copy of full historical backup Data, and then applying incremental Redo on the basis of this backup Data to obtain the instance status at a specified point in time. The shared storage database already has obvious advantages because it does not require the copy of full historical backup. However, if the volume of Redo logs is large from the backup generation point in time to the target restore point in time, the Time required for restore is still very considerable. As mentioned earlier in fault recovery, the situation where the same Page is repeatedly read and written in IO bottleneck scenarios becomes more obvious in instance restore scenarios because the total amount of Redo logs is larger. The essence of this problem is that the restore procedure is performed in the order of Redo generation. The core idea of One-Pass Restore proposed by MVD is to recover by Page order instead of Redo log access order. As the name implies, each Page only needs to undergo one read IO and one write IO during the entire recovery procedure, while applying all its required Redo logs.
In the One-Pass recovery procedure, for a Page, all its Redo Content is accessed through the multi-version log index. Considering that Redo logs of different Pages exist in an interlaced manner in continuous Redo files, and the log index has the attribute of segmented sorting, extra IO amplification caused by accessing Redo logs or log indexes must be avoided. Therefore, we implemented a log merge policy, which mainly includes the following three layers:
ib_parsedata file may contain multiple segments. Each segment is sorted by Page internally, but the segments are not contiguous. The first step of One-Pass Restore is to merge these segments to achieve global ordering of Pages within the file.The backup and restore policy of One-Pass Restore can achieve significant improvements compared to traditional restoration based on Redo order because: 1) A single Page is read and written only once, eliminating I/O amplification. 2) With a global view of Pages, situations such as page reuse and file deletion are detected in advance to avoid unnecessary restoration. 3) Parallel bottlenecks are eliminated, and each stage can fully concurrently utilize the huge I/O bandwidth advantages brought by the separation of storage and computing. This comprehensive policy effectively mitigates I/O amplification issues in I/O-constrained scenarios, ensuring the efficiency and simplicity of the data restoration process.
However, this policy, which requires restoring to a new instance and waiting for completion to provide services, still appears slow in some scenarios. In this regard, the MVD engine also supports providing Backtrack capabilities, prioritizing service recovery, deferring time-consuming Page processing to the background, and accepting performance degradation within a short period after recovery. When Backtrack is enabled, Users can issue a "Backtrack to UNIX timestamp" command via the console when needed to recover to a specified point in time. Then, the instance will restart and reflect the status of the specified target point in time after the restart, and complete the backtrack of the Page status via Log Indexing when real user requests occur.
In the work of CloudJump II, we analyzed cloud-native databases. After moving further from the separation of computing and storage to shared storage, while obtaining extensibility and security advantages, the issue of primary-secondary node Consistency must be faced. Different from the Page Server method adopted by many cloud databases previously, CloudJump proposes a more general idea of promoting shared storage architecture based on standard cloud storage services. This is achieved by introducing the MVD engine at the compute node layer to handle primary-secondary Consistency issues, and enhance scalability, recovery capabilities, and data persistence capabilities, without the need for a custom storage layer.
For more Content, you can refer to the paper: "CloudJump II: Optimizing Cloud Databases for Shared Storage".
The Issue of B-tree Height with Large Data Volumes in a MySQL Single Table
ApsaraDB - December 21, 2022
ApsaraDB - October 26, 2023
Alibaba Clouder - May 30, 2018
ApsaraDB - February 5, 2026
Alibaba Cloud Community - October 29, 2024
digoal - August 3, 2021
PolarDB for PostgreSQL
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn More
PolarDB for Xscale
Alibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn More
PolarDB for MySQL
Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn More
Database for FinTech Solution
Leverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by ApsaraDB