Point-in-Time Recovery for PolarDB-X Operator: Leveraging Two Heartbeat Transactions

This article explains PolarDB-X Operator's global PITR (point-in-time recovery) achievement for XA/TSO transactions and introduces a two-heartbeat transaction recovery method.

By Busu

1. Introduction

Database recovery methods include backup set recovery and point-in-time recovery (PITR). As the name suggests, backup set recovery directly uses the saved data backup set for recovery and can only restore the database to a fixed database state at a certain time. PITR, on the other hand, uses both data backup and log backup of the database. PITR first restores the data to the database state at a certain time using the data backup. There is a log point in the data backup set, and logs from the log point to the specified time are downloaded for playback, allowing the database to be restored to the specified time.

In a standalone MySQL database, if data backup sets and binlogs exist, you can use the XtraBackup tool to restore the backup sets and then use the mysqlbinlog tool to play back the binlogs. As a distributed database, PolarDB-X stores data through components such as the Global Metadata Service (GMS) and the Data Node (DN). Compute Nodes (CN) are stateless. When PolarDB-X performs recovery at a specified point in time, the general idea is to restore both GMS and DN to a certain point in time. However, in a distributed database, there are distributed transactions, and branch transactions will fall on different data nodes. If each data node is restored independently, the atomicity of distributed transactions cannot be guaranteed.

The previous article Safeguarding Your Data with PolarDB-X: Backup and Restoration explains how to perform globally consistent PITR when the transaction policy is a TSO transaction. This article describes how the PolarDB-X Operator achieves globally consistent PITR when the transaction policy is an XA or a TSO transaction and proposes a recovery scheme based on two heartbeat transactions.

2. Scheme Comparison

After reading the previous two articles Safeguarding Your Data with PolarDB-X: Backup and Restoration and Interpretation of Global Binlog and Backup and Restoration Capabilities of PolarDB-X 2.0, we can see that there are two schemes for PITR in PolarDB-X, referred to as PITR based on TSO and PITR based on global binlogs. The comparison is as follows:

Scheme Name	Whether the business side must use TSO transactions	Full backup set	Log recovery volume	Whether it relies on the stability of CDC components	Playback efficiency	Whether it requires a heartbeat transaction
PITR based on TSO	Yes. TSO transactions are based on XA transactions and a global timestamp is added to achieve a repeatable read isolation level. The performance is lower than XA transactions	Data node physical backup	Small. Full backup set end point + incremental logs	No	High. You can use SQL threads on the data node to play back binlogs. Each data node is played back concurrently	No
PITR based on Global binlogs	No	Data node physical backup or PolarDB-X logical backup	Large. Full backup set start point + incremental logs	Yes	Low. You need a third-party tool to convert the global binlog into SQL and execute it on the compute node.	Yes
PITR based on two heartbeats	No	Data node physical backup	Small. Full backup set end point + incremental logs	No	High. You can use SQL threads on the data node to play back binlogs. Each data node is played back concurrently	Yes

The PITR scheme based on global binlogs is significantly different from the other two schemes. It is less dependent on the internal implementation of PolarDB-X but consumes a large amount of computing resources. In most cases, it can be used to synchronize data from PolarDB-X to another system.

The PITR scheme based on two heartbeats introduces continuous heartbeat transactions to solve the problem when the business side must use TSO transactions in the scheme based on TSO.

3. Scheme Interpretation

We propose a scheme based on two continuous heartbeat transactions to achieve globally consistent binlog point pruning, which only requires the PolarDB-X transaction policy to be an XA transaction or a TSO transaction.

This scheme requires a broadcast table. For example, the table creation statements are:

CREATE TABLE `__heartbeat__` (
        `id` bigint(20) NOT NULL AUTO_INCREMENT BY GROUP,
        `sname` varchar(10) DEFAULT NULL,
        `gmt_modified` datetime(3) DEFAULT NULL,
        PRIMARY KEY (`id`)
) ENGINE = InnoDB AUTO_INCREMENT = 2 DEFAULT CHARSET = utf8mb4  broadcast

By updating a record in this table at regular intervals (the period in the PolarDB-X Operator is 1 second, which ensures that the error of the recovery point in time does not exceed 1 second), distributed heartbeat transactions can be generated, for example:

set drds_transaction_policy='TSO';
replace into binlogcut.`_heartbeat__`(id, sname, gmt_modified) values(1, 'binlogcut', now());

Next, we will explain how to use two-heartbeat distributed transactions to determine the consistency point. We can suppose that one is a distributed transaction T_m, and mark its Prepare Event on the binlog as P_m, Commit Event as C_m, and two continuous heartbeat transactions as T_n^h and T_n+1^h.

We can obtain the following known information:

Events in a single binlog are written in chronological order. For example, if the file offset of Event 1 is greater than that of Event 2, it means that Event 1 occurs earlier than Event 2.
According to the XA protocol, the Commit can be executed only after all branch transactions are Prepare. Therefore, for binlogs of the same transaction in any data node, the Prepare event is earlier than the Commit event.
Heartbeat transactions are written in a single thread to the same record in the broadcast table. Due to the existence of row locks, continuous heartbeat transactions also appear continuously in the Prepare event and Commit event of each data node binlog. There is no successfully executed Prepare Event or Commit Event of other heartbeat transactions between the Prepare Event and Commit Event of a binlog.

Having known the first and the second piece of information above, if we find transactions T₁ and T₂ have the following event flow in a binlog: C₁P₂, we can know it means that the Commit time of transaction T₁ is earlier than the Prepare time of transaction T₂. Therefore, we can know: Prepare time of transaction T₁ < Commit time of transaction T₁ < Prepare time of transaction T₂ < Commit time of transaction T₂, that is:

From the macro perspective: Prepare time of transaction T₁ < Commit time of transaction T₂
From the micro perspective: P₁ precedes C₂ on the binlog stream on each data node

Try to prune binlogs of each data node with the Prepare Event end position of the heartbeat transaction T₁^h:

After pruning according to the above figure, we can perform the analysis by using the P_n of the distributed transaction T_n, and C_n positions on different data nodes:

Case 1

T_n completes the commit before a heartbeat transaction T₁h is initiated, and both P_n and C_n fall before P₁^h. Obviously, the current pruning line will not destroy the transaction characteristics of T_n.

Case 2

T_n has not decided to commit before the heartbeat transaction is initiated and has not issued a commit. If this is the case, we only need to roll back the hanging transaction after recovery.

Case 3

T_n has decided to commit before the heartbeat transaction is initiated and has issued a commit, but has not yet completed the commit of all branches. In the binlog of a data node, there is an event flow of... P_n... C_n... P₁^h... C₁^h... According to the above inference, we can conclude that P_n precedes C₁^h in the binlog of each data node. If P_n occurs before P₁^h, we only need to commit the hanging transaction after recovery.

Question 1: P_n occurs after P₁^h:

P_n may appear in the following locations:

There is a problem that the branch transactions of the transaction T_n cannot be committed on DN3. Therefore, the pruning position on DN3 needs to be set to the end position of the P_n event. However, there may be a new committable transaction T_m. We also retain all its P_m events, meeting the following conditions: T_m Prepare time < T_m Commit time < T_n Commit time < T₁^h Commit time, so T_m Prepare time < T₁^h Commit time and P_m precedes C₁^h. The above process of pruning point expansion will not continue indefinitely but end before C₁^h.

Question 2: How to determine that T_n has sent a commit before the heartbeat transaction T₁^h? :

If we traverse all binlogs of each data node before P₁^h, we can definitely find this Commit Event. In the production environment, the amount of binlog data generated every day is huge, and binlogs will have a retention period, so it is not feasible to traverse all binlogs.

In the commit phase of the two-phase commit protocol, once the commit has been decided, the commit must be run even if an exception occurs. In practice, for this behavior of deciding to commit, a transaction log needs to be written to record that the distributed transaction has been decided to commit. After the database recovers to a healthy state and finds an uncommitted branch transaction, the transaction log is queried to judge whether the transaction is finally committed or rolled back.

When there is a C_n before T₁^h on a data node, it can be seen that the time to record the transaction log < the time to commit the branch transaction < the Prepare time of the heartbeat transaction < the time to commit the heartbeat transaction. The operation to record the transaction log precedes the commitment of the heartbeat transaction. What can be performed during the pruning is to keep all its P_n and its transaction logs, and we need to prune to C₁^h. After the recovery is completed, we can query the transaction log to determine whether to roll back or commit the transaction.

Introducing the Second Heartbeat Transaction

As shown in the figure above, we introduce the second heartbeat transaction T₂^h to solve Questions 1 and 2 at the same time. We analyze each data node binlog [ P₁^h start position, C₂^h end position] and set P₂^h as the initial pruning position:

C₁^h precedes P₂^h, so Question 2 is solved, that is, the Commit Event appears before P₁^h. Transactions that do not appear in the interval of [P₁^h start position, P₂^h end position] can be processed correctly.
If the Commit Event of a transaction appears in the interval [ P₁^h start position, P₂^h end position], the iteration method in Question 1 can be used to obtain a final pruning position before C₂^h.

The iteration process records the committable transaction ID, which facilitates the commit of the corresponding hanging transaction after recovery.

Among them, for transactions that are not in the recorded list of to-be-committed hanging transactions, we query the transaction log to determine whether they need to be committed, and the remaining branches are rolled back.

Consistent Backup and PITR

After a physical backup is initiated for a PolarDB-X data node, the completion time of the physical backup varies for each data node. Therefore, a globally consistent backup set = physical backup + incremental logs used to restore data to a globally consistent state.

How to Obtain the Consistency Point of an Incremental Log?

Lock mode: After completing the physical backup of all data nodes, the database is locked to prevent transaction commits, and the binlog point of the data node is recorded.

Lock-free mode: After completing the physical backup of all data nodes, a portion of incremental logs is retained. You can restore data to a globally consistent location by PITR.

With a consistent backup set, users can quickly restore a globally consistent database instance by restoring physical backups and playing back incremental logs as little as possible, if necessary.

Consistent Recovery of Metadata and Data Files

This scheme ensures that all data nodes are restored to a globally consistent state. However, because the metadata on GMS does not participate in distributed transaction collaboration, if the NTP machine clock is inconsistent (for example, the time on the GMS machine node is one second slower than the time on the data node machine), there may be a possibility that the restored GMS metadata has a column in a table, but the restored data node does not have this column.

In general, during the backup process, operations that collaboratively modify data on the data node and GMS, such as concurrent DDL operations and the addition and deletion of data nodes, may cause inconsistency between the restored metadata and business data. Therefore, we have two optimization schemes for this problem:

GMS nodes are also managed by the distributed transaction TSO to ensure that the TSO-based recovery policy uses a consistent timestamp instead of the physical time of each host.
After recovery, query the GMS metadata to check whether such operations have been performed in the last period of time (which can be set to a variable, 10 seconds or 1 minute). If so, give a prompt to advise the user to adjust the recovery time to avoid such problems.

4. Summary

This article describes how to use two successive distributed heartbeat transactions and the dependency between the Prepare Event and the Commit Event to obtain a consistent binlog pruning point. During recovery, restore the logs to the pruning point, commit the pending transactions recorded during pruning, and commit or roll back some transactions by transaction logs.

Community

Point-in-Time Recovery for PolarDB-X Operator: Leveraging Two Heartbeat Transactions

1. Introduction

2. Scheme Comparison

3. Scheme Interpretation

Case 1

Case 2

Case 3

Introducing the Second Heartbeat Transaction

Consistent Backup and PITR

How to Obtain the Consistency Point of an Incremental Log?

Consistent Recovery of Metadata and Data Files

4. Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Database Backup