An In-depth Analysis of Atomicity in PolarDB

By Youxi

1. Preface

In a database system, the query optimizer and transaction system are two of its major bedrocks. These components are so important that a large number of data structures, mechanisms, and features in the whole database architecture design are built around them. The query optimizer is responsible for speeding up queries and organizing the underlying data system more efficiently. The transaction system is responsible for storing data safely, stably, and persistently, while providing logical implementation for users' read and write concurrency.

In this blog, we'll be exploring the database transaction system and analyze the atomicity in the PolarDB transaction system. Due to the intricate nature of this system, we will present this topic in two parts. Part one focusses on the concept of atomicity itself while part two focusses on the implementation of atomicity in PolarDB.

2. FAQs

Before we start, consider a few important questions. You may have asked these questions before knowing the database. However, these questions may have been simply answered by "pre-write log" or "crash recovery mechanism". We want to dive deep into the implementation and internal principles of these mechanisms.

How exactly is database atomicity guaranteed? What special data structures are used? Why use them?
Why can the data I write successfully be guaranteed against loss?
Why can I completely recover the data that I have submitted logically after a database crash?
What is logically submitted data? Which step is considered a real submission?

3. Background

3.1 The Position of Atomicity in ACID

After the famous ACID (atomicity, consistency, isolation, durability) features were put forward, the concept of ACID has been frequently talked about (originally written into the SQL92 standard). These four features can roughly summarize the core demands of people for databases. Atomicity is the first feature to be discussed in this article. We first focus on the position of atomicity in transaction ACID.

Here is my understanding of the relationship between atomicity, consistency, isolation, durability of a database. I think ACID features of a database can be defined from two perspectives. AID (atomicity, durability, isolation) features are defined from the perspective of the transaction itself, while C (consistency) is defined from the perspective of the user.

The following is my understanding of each feature:

Atomicity: We still start from the concepts of these features. The concept of atomicity is that a transaction is either executed successfully or fails, that is, all or nothing. This feature can be defined by the smallest transaction model. Suppose there is a transaction, and we can realize its real submission or rollback through a set of mechanisms. Then our goal is achieved. Users only make a submission through our system. The focus of atomicity is not on the success or failure of the transaction itself, but to ensure that the transaction system only accepts two states: success or failure. And there are supporting strategies to ensure that the physical and logical results of success or failure are consistent. Atomicity can be defined by the features of the smallest transaction unit. It is the cornerstone of the entire transaction system.
Durability: It means that once a transaction is submitted, it can be permanently stored in the database. The scope and perspective of durability are almost consistent with atomicity. This makes the two closely linked with each other in concept and implementation. Both ensure data consistency and recoverability to some extent, and the boundary is the moment when the transaction is submitted. For example, suppose the current state of data is T and transaction A tries to update the data's state to T + 1. If transaction A fails, then the database state returns to T, which is guaranteed by atomicity. If transaction A is successfully submitted, then the moment the transaction state becomes T + 1 is guaranteed by atomicity. Once the transaction state becomes T + 1 and the transaction is successfully submitted, the transaction has ended and atomicity no longer exists. This T + 1 state is guaranteed by durability. From this perspective, it can be inferred that atomicity guarantees the crash recovery of data before transaction submission, while durability guarantees crash recovery after transaction submission.
Isolation: It is also a mechanism defined at the transaction level. It provides a certain degree of isolation guarantee for transaction concurrency. The essence of isolation is to prevent transaction concurrency from causing inconsistent states. As it is not the focus of this article, it will not be talked about in detail here.
Consistency: It means that the database must remain consistent after one or more transactions. If we understand it from the perspective of the transaction, ensuring AID can ensure that transactions are serializable, recoverable, and have atomicity. But is transaction state consistency the true consistency? If AID is destroyed, C will be destroyed. But when AID is guaranteed, will C be guaranteed? If the answer is yes, then this concept loses its meaning. We can guarantee AID to ensure that transactions are consistent, but can we prove that the consistency of transactions necessarily ensures the consistency of data? Besides, the concept of data consistency is hard to accurately define through transactions, but it is easy to define at the user level. Data consistency means that users believe that the state of the data in the database at any time meets their business logic. For example, bank deposits cannot be negative as the user defines a non-negative constraint. I think this is a smart mechanism created by concept designers, who regard consistency as a higher-order goal.

This article mainly focuses on atomicity, and the topic of crash recovery in this article may involve durability. Isolation and consistency are not discussed in this article. In the visibility section, we default that the database has complete isolation, that is, the isolation level that can be serialized.

3.2 The Inherent Requirements of Atomicity

We have talked a lot about the database's transaction features, and let's move on to our topic: atomicity. We still take the preceding example to explain atomicity. Suppose the current state of the database is T, and now we want to upgrade the data state to T + 1 through transaction A. Let's take a look at the atomicity in this process.

If we want to ensure that this transaction is atomic, we can define three requirements. Only when the following three requirements are met can we say that the transaction is atomic:

The database has a time point when the transaction is successfully submitted.
Transactions (or snapshots obtained) started before this time point should only see the T state. Transactions (or snapshots obtained) started after this time point should only see the T +1 state.
For crashes before this time point, the database should be able to return to the T state; and for crashes after this time point, the database should be able to return to the T +1 state.

Note that we have not defined this time point, and we are not even sure whether the time point in 2/3 is the same time point. What we can be sure of is that this time point exists Otherwise, we cannot say the transaction is atomic. Atomicity determines that there must be a definite time point for submission and rollback. In addition, according to our description just now, we can infer the time point in 2 can be defined as an atomicity point. The submissions before the atomicity point are not visible to us and are only visible after. Thus, for other transactions in the database, this atomic point is the time point when the transaction is submitted. The point in 3 can be regarded as a durability point because this conforms to the definition of crash recovery by persistence. That is, for durability, the transactions after point 3 have been submitted.

4. Discussion of the Atomicity Scheme

4.1 Start with Two Simple Schemes

First of all, let's talk about atomicity from two simple schemes. This is to explain why the data structures we will introduce in each step are essential to realizing atomicity.

Simple Direct IO

Suppose there is such a database: each user operation writes data to disk. We call this method Simple Direct IO, and "simple" means that we do not record any data logs but only the data itself. Suppose that the initial data version is T. Then if a data crash occurs after we insert some data, a T + 0.5 version of the data page will be written to the disk. And we have no way to roll back or continue subsequent operations. Such a failed case undoubtedly violates atomicity because the current state is neither submission nor rollback but an intermediate state. So, this is a failed attempt.

Simple Buffer IO

We have another scheme called Simple Buffer IO. Again, we have no logs, but we have added a new data structure called "shared buffer pool". Each time we write a data page, we do not write the data directly to the database but the shared buffer pool. This has obvious advantages. First, the read and write efficiency will be greatly improved. We do not have to wait for the data page to be actually written to disk before we do other things. Instead, we can proceed with this task asynchronously. Secondly, if the database rolls back or crashes before the transaction is submitted, we only need to discard the data in the shared buffer pool. Only when the database is successfully submitted can it actually write the data to disk. Then it seems that we have met the requirements in terms of visibility and crash recovery.

However, there is still a tricky problem in the above scheme, that is, data persistence is not as simple as we think. For example, if there are 10 dirty pages in the shared buffer pool, we can use storage technology to ensure that the disk flush of a single page is atomic. But the database may crash at any time during the period of these 10 pages. Then, no matter when we decide to write data, once the machine crashes during this process, the data may generate a T + 0.5 version on the disk. And even after restarting, we still cannot redo or roll back.

The above two examples seem to show that the database cannot ensure data consistency without relying on other structures (another popular scheme is Shadow Paging of SQLite database, which is not discussed here). Therefore, if we want to solve these problems, we need to introduce another important data structure: data log.

4.2 Pre-write Log + Buffer IO

Scheme overview

Based on Buffer IO, we introduced data logs to tackle data inconsistency.

The idea behind the part of the data cache is the same as our previous idea, except that we will record an additional xlog buffer before writing the data. These xlog buffers are logs with sequence, and their serial numbers are called lsn. We record the log lsn of data on the data page. Each data page has recorded the latest serial number of the log that has updated it. This feature is to ensure the consistency of logs and data.

Assume that if the log we can introduce is exactly the same as the data version, and the data log is persistent before the log, then we can recover the data from this consistent log page whenever the data crash. By doing so, the data crash issue mentioned earlier can be solved. Regardless of whether the crashes are before or after transaction submission, we can recover the correct version of data through log playback. Thus, the atomicity of crash recovery is realized. And we can implement the visibility through multi-version snapshots. It is not easy to ensure that the data log is consistent with the data. Let's take a look at how to ensure this and how to recover the data when it crashes.

Transaction Submission and Control of Dirty Page Flush

The purpose of Write Ahead Log (WAL) is to ensure data recoverability. In order to ensure the consistency between WAL log and data, when the data cache is persisted to disk, the WAL log corresponding to the persistent data page must be persisted to disk first. This expounds the essence of controlling dirty page flush.

There is a process in the database background called checkpoint, which periodically performs checkpoint operations. When checkpoint is performed, a checkpoint log is written to the xlog. This checkpoint log contains the current REDO point. Checkpoint ensures that all current dirty data have been flushed to the disk.
During the first insert operation, this page cannot be found in the shared memory. This page will be loaded into the shared memory from the disk. Then the input of this insert is written, and an xlog of data write is inserted into the xlog buffer to upgrade the log label of this table from LSN0 to LSN1.
At the time of transaction submission, the transaction will write a transaction submission log. Then all WALs in the wal buffer pool submitted by this transaction will be flushed to disk.
Then the second piece of data B is inserted. It will insert an xlog of data write into the xlog buffer and upgrade the log label of this table from LSN1 to LSN2.
Same operation as 3).

After that, if the database runs normally, the bgwriter/checkpoint processes will asynchronously flush the data pages to the disk. If the database crashes, the data can be recovered in the shared buffer pool through log playback and then written to the disk asynchronously. This is because the data log and transaction submission log of A and B logs have been flushed to the disk.

fullpage Mechanism Guarantees Recoverability

The recovery of WAL seems to be perfect, but unfortunately, there are still some flaws in the scheme just mentioned. Suppose that when a bgwriter process encounters CRASH of the database when writing data asynchronously, some dirty pages are written to the disk, and there may be bad pages on the disk. (data page of PolarDB is 8k. In extreme cases, 4k write on disk may create bad pages) However, WAL cannot replay the data on bad pages. So another mechanism is needed to ensure that the database can find the original data in extreme cases. This is the important mechanism: fullpage.

When data is modified for the first time after each checkpoint operation, PolarDB writes the modified data together with the entire data page to the wal buffer and then flushes it to the disk. This kind of WAL containing the entire data page is called backup block. The backup block enables WAL to replay the complete data pages under any circumstances. The following content describes a complete process.

Checkpoint operation.
During the first insert operation, this page cannot be found in the shared memory. This page will be loaded into the shared memory from the disk. Then the input of this insert is written. Here it is different from the operation in the previous section. WAL with the PolarDB ordinal number LSN1 writes the entire page marked as LSN0 read from the disk to the wal buffer pool.
The transaction is submitted, and the entire WAL is forcibly flushed to the WAL area on the disk.
Same operation as described in the previous section.
Same operation as described in the previous section.

At this time, if the database crashes, when the database is pulled up again for recovery and it encounters a bad page, the correct data can be played back step by step through the original version of the page recorded in the original WAL.

Crash Recovery Mechanism Based on WAL

After learning the first two sections, we can continue to talk about how the data are played back if the database crashes. Here we demonstrate a playback of a data page that is written badly.

When the database plays back to this WAL of writing data A, it reads the TABLE A page from the disk. The WAL here is a backup log. This is because, after checkpoint operation, the first WAL on each playback page is a backup log.
When this log is played back, it follows a special playback rule: it always overwrites the original page with its own page and upgrades the LSN of the original page to the LSN of this page. (To ensure data consistency, normal playback of pages will only replay WALs larger than their LSN numbers). In this example, due to the existence of the backup block, the badly written page is successfully restored.
PolarDB plays back subsequent logs based on normal playback rules.

After the data playback is successful, the data in the shared buffer pool can be asynchronously flushed to the disk to replace the previously damaged data.

We have devoted a large part of this article to how the database achieves crash recovery through pre-writing logs. It seems to be able to explain the meaning of durability point. Next, we will talk about the visibility issue.

4.3 Visibility Mechanism

Our description of atomicity involves the concept of visibility. Visibility is implemented by a complex set of MVCC mechanisms in PolarDB, most of which belong to the category of isolation. A brief description of visibility will be given here, while a more detailed description will be continued in an article on isolation.

Transaction Tuple

The first thing to talk about is the transaction tuple. It is the smallest unit of data, where the data actually reside. Here we only need to focus on a few fields.

t_xmin: the ID of the transaction that generated the data.
t_xmax: the ID of the transaction that modified the data (the ID of the transaction that deleted or locked the data)
t_cid: the ordinal number of the operation on the tuple in the same transaction.
t_ctid: a pointer consisting of a segment number and offset, pointing to the latest version of data

Snapshot

The second thing to talk about is snapshot. A snapshot records the state of a transaction in a database at a certain time point.

All we need to know about snapshot is that it can be used to obtain the state of all possible transactions in the database at a certain time point from procArray.

Current Transaction State

The third point is the current transaction state. It refers to the mechanism in the database that determines the running state of the transaction. In a concurrent environment, it is very important to determine the transaction state we see.

When viewing the transaction state in a tuple, three data structures: t_infomask, procArray, and clog may be involved.

infomask: located in the cache flag bit of the tuple head. It marks the running state of the tuple's xmin and xmax transactions. The state can be regarded as a layer of an asynchronous cache of clog to accelerate the acquisition of transaction state. Its state setting is asynchronous, and not all transaction-related tuples are upgraded immediately when the transaction is submitted. Instead, it waits for the first snapshot setting that is new enough to see this update before it sets the state.
procArray snapshot: the transaction state in the snapshot. The acquisition of the snapshot is actually the acquisition of the state of all transactions in the database at this moment in procArray. Once the snapshot is acquired, its state is constant unless it is acquired again (whether the acquisition content in the same transaction changes depends on the transaction isolation level).
clog: the actual state of the transaction. It is divided into two parts: clog buffer and clog file. clog buffer records the state of all transactions in real time.

In a visibility judgment process, the access order is infomask -> snapshot -> clog, and the decisive order is snapshot-> clog -> infomask.

infomask is the most easily obtained information, which is recorded in the head of the tuple. Under some conditions, the visibility of the current transaction can be clarified through infomask without involving the following data structures. Snapshots have the highest level of decision-making power, and finally determine whether the state of xmin and xmax transactions is running or not running; clog is used to assist in visibility judgment and setting the value of infomask. For example, if the judgment of xmin transaction visibility shows that it has been submitted in snapshot and clog, t_infomask will be set to be submitted. However, if it shows that it has been submitted in snapshot and not in clog, the system determines that a crash or rollback has occurred and sets infomask to the illegal transaction.

Transaction Snapshot Visibility

After introducing tuples and snapshots, we can move on to the topic of snapshot visibility. The visibility in PolarDB has a complex definition system. It is defined by many combinations of information. But the most direct ones are snapshots and tuple headers. The following illustrates the visibility of tuple headers and snapshots through an example of data insertion and update.

Isolation is not discussed in this article, and we assume that the isolation level is serializable.

Snapshot1 moment: Transactions 1184 and 1187 have not started, and there is no record in the tuple. The student table is empty. The data that can be obtained through Snapshot1 is empty. We record this version as T.
Snapshot1-Snapshot2 moment: If we take the snapshot at the moment, we still get Snapshot1, then the data version we see should still be T.
Snapshot2 moment: At this time, transaction 1184 has ended and 1187 has not started. Therefore, the modification on 1184 is visible to the user, while 1187 is still invisible. Specifically, tuple headers such as (1184/0) can be seen in tuples, so what we see is the data version of Tom. We record this version as T + 1.
Snapshot2-Snapshot3 moment: At this moment, if we take the snapshot, we still get Snapshot2, then the data version we see should still be T + 1.
Snapshot3 moment: At this moment, transactions 1184 and 1187 are over and both are visible, so we can see that both (1184,1187) and (1187,1187) are not visible in the tuple, while (1187,0) (namely Susan) is visible. We record this version as T + 2.

From the above analysis, we can draw a simple conclusion: the visibility of the database depends on the timing of the snapshot. The so-called different visibility versions in atomicity mean that different snapshots are taken. Snapshots determine whether an executing transaction has been submitted. This kind of submission has nothing to do with the submission state of transaction mark or even the submission of record clog. We can use this method to make the snapshot we get consistent with the transaction submission.

Visibility in Transaction Atomicity

We have briefly talked about the visibility of PolarDB snapshots already, and here we will briefly introduce the specific implementation issues when transactions are submitted.

The core idea of our visibility mechanism design is: transactions should only see the version of the data they should see. How should we define "should see"? Here is a simple example: if a tuple's xmin transaction is not submitted, it is very likely to be invisible to other transactions. If the xmin transaction of a tuple has been submitted, it is likely to be visible to other transactions. How to know whether this xmin has been submitted or not? As mentioned above, we decide through snapshots, so the key mechanism of transaction submission is the update mechanism of the new snapshot.

When a transaction is submitted, visibility involves two important data structures: clog buffer and procArray. The relationship between the two has been explained above. They play a role in judging the visibility of transactions. Of course, procArray plays a decisive role. This is because the acquisition of snapshots is a process of traversing ProcArray.

Actually, in the third step, the information of the transaction submission will be written to the clog buffer. Transaction clog at this time is marked as submitted, but it is still not. Then the transaction marks ProcArray as "has been submitted". In this step, the transaction has completed the actual submission. The snapshot obtained after this time point will update the data version.

5. Implementation of Atomicity in PolarDB

After learning the PolarDB crash recovery and visibility theory, we can know that PolarDB uses a pre-writing log and Buffer IO scheme to ensure transaction crash recovery and visibility consistency, thus achieving atomicity. Then let's move on to the most important aspects of transaction submission to find out what exactly the atomicity point we mentioned earlier refers to.

5.1 Consistent Transaction Crash Recovery: Durability Point

Simply put, there are four operations in the transaction submission that are most important to transaction atomicity. In this section, we talk about the first two operations.

Commit logs of the submission transaction (WAL of Commit).
All submitted WALs of this transaction are forcibly flushed and persisted to storage.

When we mark the point of this xlog (WAL log) write, let's consider two scenarios:

If the transaction crashes or rolls back before this point, the Commit log definitely is not flushed regardless of whether the data log is flushed or not. Due to the sequential nature of WAL, the Commit log definitely is the last to be persisted to disk. At this point, if we replay the data, we will find that transactions that lack Commit logs cannot be marked as submitted. Data of this state is invisible according to visibility. These data will later be regarded as dirty data and cleaned up. So we can conclude that if it crashes before this point, the transaction is not submitted. The database is actually restored to state T.
If it crashes or rolls back after this point, no matter at which step it crashes or rolls back, we can be sure that the Commit log is flushed to disk. Once the Commit log is flushed to the disk, the data written by this transaction can certainly be played back and marked as submitted. Then this data is visible. This transaction has been submitted and the database has been restored to T + 1.

This seems to show that point 2 is the critical point of crash recovery. It indicates that database crash recovery can return to T or T + 1 state. So how do we call this point? Let's recall the concept of persistence: once a transaction is submitted, the modification of the database by transaction is permanently retained in the database. The two are actually consistent in essence. So we call point 2 durability point.

Another point about xlog disk flushing is that xlog disk flushing and playback have the atomicity of a single file. CRC check in the WAL log head provides validity check of a single WAL file. If a WAL is damaged in disk write, the contents of this WAL log are invalid. This ensures that there will not be partial playback of data.

5.2 Consistent Visibility of Transactions: Atomicity Point

Let's continue to look at operations 3 and 4:

Write this transaction submission to the clog buffer.
Write the results of this transaction submission to ProcArray.

Operation 3 records the current state of the transaction in the clog buffer. It can be regarded as a layer of log cache. Operation 4 writes the submission operation to ProcArray, which is a very important step. So we can know that the snapshot determines the transaction state through ProcArray. That is, this step determines the state of the transaction that other transactions see.

If the transaction crashes or rolls back before operation 4, the data version seen by all other transactions in the database is T. This means the transaction has not actually been submitted. This judgment is determined by the order of visibility -> snapshot -> Procarray.

While after operation 4, the transaction is submitted for all observers because all snapshot data versions obtained after this time point are T + 1.

From this point, operation 4 fully meets the meaning of atomic operation. This is because the execution of operation 4 affects whether the transaction can be successfully submitted. The Transaction is always allowed to be rolled back before operation 4 because no other transactions see the T + 1 state of the transaction. However, after operation 4, the transaction is not allowed to be rolled back. Otherwise, when other transactions of version T + 1 are read, it will cause data inconsistency. The concept of atomicity is that transactions are successfully submitted or rolled back when submission is failed. Since rollback is not allowed after operation 4, operation 4 can be seen as a sign of successful transaction submission.

In summary, we can define operation 4 as the atomicity point of the transaction.

5.3 Durability Point and Atomicity Point

Requirements for Atomicity and Durability

Again, let's look at the concept of atomicity and durability:

Atomicity: A transaction either succeeds or fails.
Durability: Once a transaction is successfully executed, it can be permanently stored in the database.

We mark operation 4 as an atomicity point. Because at the moment of operation 4, objectively all observers think that the transaction has been submitted. The version of the snapshot is upgraded from T to T + 1, and the transaction can no longer be rolled back. Then once the transaction is submitted, will atomicity not take effect? I think yes. Atomicity only guarantees data consistency of the moment transaction is successfully submitted. It makes no sense to talk about atomicity when the transaction has ended. Therefore, atomicity ensures the visibility and recoverability of transactions before the atomicity point.

We mark point 2 as a durability point because according to durability, the transaction can be retained permanently after it is successful. Based on the above speculation, this point is undoubtedly durability point 2. Therefore, we should ensure durability all the time starting from point 2.

How to Understand the Two Points

After explaining points 2 and 4, we can finally define the two most important concepts involved in transaction submission. We can now answer the first question: at what moment is the transaction actually submitted? The answer is that the transaction can be completely restored after the durability point. While the transaction after the atomicity point is regarded as actually submitted by other transactions. But the two are not separate. How can we understand this?

I think this is a compromise of atomicity. As long as the order of the two points can make the data in different states consistent, then it meets our definition of atomicity.

If crash or rollback occurs before the durability point, the transaction fails, and the data version before the crash or after recovery is T.
If crash or rollback occurs between the durability point and atomicity point, the visibility version of the transaction is T. This means that all transactions in the database see version T. After the rollback, the data is played back to T + 1. At this time if we restart the database, we will find that the data version of the snapshot obtained before the database crash is T. And the data version of the snapshot obtained after the crash is T + 1. It seems that the transaction is implicitly submitted. However, this does not violate data consistency.
If the crash occurs after the atomic point, and one transaction has been submitted, then the transaction sees the data version of T + 1 before and after the crash.

Finally, let's consider why the two points are not merged. The operation of the durability point is the disk flushing of the WAL log, which involves disk I/O. What the atomicity point does is to write ProcArray, which requires a big lock on ProcArray that is seriously competed. We can consider it as a high-frequency shared memory write. Both are related to the efficiency of database transactions. If the two are bound to form an atomic operation, the waiting for the two will be annoyingly long. This may greatly reduce the operation efficiency of transactions. From this perspective, the separation of behaviors of the two is out of efficiency consideration.

Can the order of the two be reversed?

Obviously not. In the above diagram, we can see that there may be segments that do not meet the requirements of atomicity or durability in the middle of this time period.

Specifically, suppose the atomicity point is before the durability point, let's consider the transaction situation when the crash occurs between the two. Other transactions will see T + 1 version of data before the crash, and T version of data after the crash. This behavior of "seeing future data" is obviously not allowed.

How Can We Define A Real Submission?

The real submission is atomicity point submission.

The basic truth is that the real sign of submission is when the data version is upgraded from T to T + 1. This point is the atomicity point. Before this point, the data versions seen by other transactions are T, so transactions are not really submitted. After this point, the transactions cannot be rolled back. This is enough to show that this is the real submission point of the transaction.

Other Operations

Finally, let's focus on operations 1 and 3:

Operation 1 is to write wal commit log to xlog buffer, which is not critical to transaction submission. Because if log is written but not flushed to disk, it is still useless.
Operation 3 is to mark the transaction as submitted in the clog buffer; Because if the database runs normally, it does not affect the visibility of the transaction snapshot. If the database crashes, the clog state can be played back by the Commit or Abort log in xlog, regardless of whether it is persistent or not.

6. Atomicity Processes of PolarDB

6.1 Transaction Submission

In this section, let's go back to the transaction submission functions and take a look at the positions of these operations in the function call stack.

A transaction submission process involves transactions with the transaction ID. Transactions without ID do not engage in this process. A transaction without transaction ID is very likely a read-only operation, which does not affect data consistency in the database.
Strict mode will be enabled before submitting xlog. In this mode, any error will be fatal. The database will crash and restart once an error occurs.
xlog flush and CLOG write are performed in synchronous mode. xlog flush is not guaranteed in asynchronous mode, so data may be lost if the crash occurs.
There is a key operation between operations 3 and 4: Replication wait. In fact, the data xlog has been flushed at this time but not actually submitted. In synchronous mode, the primary database will wait for the xlog that has been flushed to the disk by the library to finish before proceeding with the next step.
In the write ProcArray operation, the transaction is actually submitted and can no longer be rolled back.
In the resource clean stage, the work has nothing to do with the transaction.

6.2 Transaction Rollback

If a transaction without transaction ID is rolled back, it is skipped directly.
Before the rollback, it will first determine whether the transaction has been submitted. This determination is made based on CLOG. How can a transaction be rolled back when it is submitted? This is the state between operations 3 and 4 that we mentioned above. If the CLOG has recorded the submission, then when the database encounters the rollback command, it directly fails, crashes, and restarts.
The rollback will also be written to the xlog rollback log, but it is asynchronously flushed to the disk. Even if the rollback log is not written, the data is invisible.
When a transaction is written into the rollback log in ProcArray, the transaction is actually rolled back in the process (in fact, this state does not affect other transactions, and the data version obtained before or after this state is T).

7. Summary and Outlook

Finally, let's summarize the article series, focusing on the topic of "how to achieve transaction atomicity". We talked about the database crash recovery feature and transaction visibility to explain the underlying principles of PolarDB to achieve atomicity. As we introduce the principle of pre-write log + buffer IO, we also talk about shared buffer, WAL log, clog, ProcArray. These data structures are important to atomicity. Under the transaction, the various modules of the database are cleverly integrated, making full use of computer resources like disk, cache, and I/O to form a complete database system.

We might think of other models in computer science, such as the ISO network model. In it, TCP protocol in the transport layer provides reliable communication services on an unreliable channel. Database transactions implement a similar idea, that is, to reliably store data on an unreliable operating system (may crash at any time) and disk storage (cannot support atomic storage of a large amount of data write). This simple but important idea can be said to be the cornerstone of the database system. It is so important that most of the core data structures in the entire database are related to it. Perhaps with the development of the database, future technology will produce more advanced database architecture systems. But we must not forget that atomicity and durability should still be the core of database design.

8. Final Thoughts

Alright, the talk of atomicity is over. Finally, for our fellow database enthusiasts, we leave you with several questions related to the points we mentioned in this article for you to consider.

How do you understand the atomicity point and durability point of transaction submission?
What is the relationship between the atomicity of a single transaction and the atomicity of multiple transactions? Are crash recovery and visibility integrated?
Asynchronous submission exists in PolarDB, that is, xlog log write is not required in transaction submission. Please think about which features of transactions may be violated in this mode? Is it against atomicity and durability?

References:

https://www.interdb.jp/pg/