Multi-coordinator technologies - AnalyticDB for PostgreSQL

Distributed transaction processing

AnalyticDB for PostgreSQL distributed transactions
AnalyticDB for PostgreSQL runs the two-phase commit (2PC) protocol to implement distributed transactions and uses distributed snapshots to ensure data consistency between coordinator nodes and compute nodes.

The primary coordinator node initiates a distributed transaction and runs the 2PC protocol to commit the distributed transaction to compute nodes. The 2PC protocol divides the commit process of the overall transaction into two phases: Prepare and Commit/Abort. The entire transaction is committed only after the transaction has been committed by all compute nodes that participate in the transaction. If a compute node fails the Prepare phase, the entire transaction is aborted. If a compute node fails the Commit phase and the primary coordinator node has created logs during the Prepare phase, the primary coordinator node retries the transaction. If a transaction involves only a single compute node, the system improves performance by running the one-phase commit (1PC) protocol to commit the transaction. The 1PC protocol combines the Prepare and Commit phases. The compute node that participates in the transaction ensures the atomicity of transaction execution.

The global transaction manager (GTM) located on the primary coordinator node maintains the status of all distributed transactions. The system generates a distributed transaction ID, sets a timestamp, and then records the corresponding status information for each transaction. When the system obtains a snapshot, it creates a distributed snapshot and saves the distributed snapshot in the obtained snapshot. Each distributed snapshot records the following core information:
```
typedef struct DistributedSnapshot
{
    DistributedTransactionTimeStamp distribTransactionTimeStamp;
    DistributedTransactionId xminAllDistributedSnapshots;
    DistributedSnapshotId distribSnapshotId;

    DistributedTransactionId xmin; /* Tuples are visible if XID is less than xmin. */
    DistributedTransactionId xmax;  /* Tuples are invisible if XID is greater than or equal to xmax. */

    int32count; /* The number of distributed transactions in the inProgressXidArray array. */
    DistributedTransactionId *inProgressXidArray; /* The array of distributed transactions that are being executed. */
} DistributedSnapshot;
```
When a query is being executed, the primary coordinator node serializes the distributed transactions and snapshots, and then uses the libpq protocol to send them to compute nodes. The compute nodes perform deserialization and use the distributed transactions and snapshots to determine the visibility of queried tuples. All compute nodes that participate in the query use the same distributed transactions and snapshots to determine the visibility of tuples, which ensures data consistency within the instance. AnalyticDB for PostgreSQL also caches local transactions and distributed transactions to help query the mapping relationships between local transaction IDs (XIDs) and distributed global transaction IDs (GXIDs).

Note AnalyticDB for PostgreSQL saves the commit logs of global transactions to determine whether a transaction has been committed.

In the preceding figure, Txn A inserts a piece of data, and Txn B updates the data. The heap table records two rows of data based on the multiversion concurrency control (MVCC) method of PostgreSQL. Txn B changes the xmax value in the original tuple to its XID value (0 to 4). Txn C and Txn D use their own distributed snapshots to determine visibility based on the following rules:
- If the GXID value is less than distributedSnapshot->xmin, tuples are visible.
- If the GXID value is greater than distributedSnapshot->xmax, tuples are invisible.
- If distributedSnapshot->inProgressXidArray contains the GXID value, tuples are invisible.
- If distributedSnapshot->inProgressXidArray does not contain the GXID value, tuples are visible.
If distributed snapshots cannot be used or are not required to determine visibility, local snapshots are used to determine visibility in the same way as that of PostgreSQL.

As the preceding rules apply, after Txn C queries two tuple records, Txn C finds two GXID values (100 and 105) corresponding to these records based on the mapping relationship between XID and GXID. The update of Txn B is invisible to Txn C. As a result, the query result of Txn C is foo. However, the update of Txn B is visible to Txn D. As a result, the query result of Txn D is bar.

Multi-coordinator distributed transactions

Multi-coordinator distributed transactions are enhanced based on the original distributed transactions. The preceding figure shows how the primary and secondary coordinator nodes communicate with each other. The postmaster process is a daemon. On the primary coordinator node, its backend process communicates with its GTM server by using the shared memory. However, the secondary coordinator node cannot use the shared memory to communicate with the GTM server on the primary coordinator node. Therefore, AnalyticDB for PostgreSQL establishes a channel between the primary and secondary coordinator nodes and runs a GTM protocol over the channel.

To reduce connections between the primary and secondary coordinator nodes and improve networking efficiency, AnalyticDB for PostgreSQL uses a GTM proxy to process the GTM requests of multiple backend processes on a secondary coordinator node. The following section describes how multi-coordinator distributed transactions are implemented from the aspects of GTM protocol, GTM proxy, and distributed transaction recovery:

GTM protocol

The GTM protocol is used to process transactions between the primary and secondary coordinator nodes. The following table describes the core messages of the GTM protocol.


Core protocol message	Description
GET_GXID	Assigns a GXID.
SNAPSHOT_GET	Obtains a distributed snapshot.
TXN_BEGIN	Creates a transaction.
TXN_BEGIN_GETGXID	Creates a transaction and assigns a GXID.
TXN_PREPARE	Specifies a transaction to complete the Prepare phase of the 2PC protocol.
TXN_COMMIT	Commits a specific transaction.
TXN_ROLLBACK	Rolls back a specific transaction.
TXN_GET_STATUS	Obtains the status information of all transactions of a specific coordinator node.
GET_GTM_READY	Checks whether the GTM server can process normal transaction requests.
SET_GTM_READY	Enables the GTM server to process normal transaction requests.
TXN_COMMIT_RECOVERY	Commits a specific transaction during the coordinator node recovery phase.
TXN_ROLLBACK_RECOVERY	Rolls back a specific transaction during the coordinator node recovery phase.
CLEANUP_MASTER_TRANS	Clears the remaining transactions of coordinator nodes when the recovery is complete.

The preceding messages are used to exchange GXIDs and snapshots and begin, prepare, commit, and abort transactions. To reduce cross-node communication costs and meet requirements of online analytical processing (OLAP) users, AnalyticDB for PostgreSQL provides the following consistency options for you to balance performance with consistency:

Session consistency: Expected consistency requirements are met for the same session, including monotonic read consistency, monotonic write consistency, read-your-writes consistency, and write-after-read consistency.
Strong consistency: linear consistency. Operations are visible to all sessions. Strong consistency is customized and simplified based on other consistency modes.


Mode	Transaction type	Transaction policy	Ensure atomicity (A)	Isolation level (I)	Ensure durability (D)
Session consistency (for higher performance)	Across compute nodes	2PC, shared local snapshots	Yes	RU and RC (intra-session)	Yes
Session consistency (for higher performance)	Within a single compute node	1PC, shared local snapshots	Yes	RC	Yes
Strong consistency (for higher ACID)	Across compute nodes	2PC, distributed snapshots	Yes	RC and RR	Yes
Strong consistency (for higher ACID)	Within a single compute node	1PC, distributed snapshots	Yes	RC and RR	Yes

If your requirements are high on performance but low on consistency, you can select session consistency. Compared with strong consistency, protocol messages are greatly simplified in session consistency mode. Only GET_GXID and GET_GXID_MULTI messages are retained.

In session consistency mode, a secondary coordinator node needs only to obtain the GXID from the primary coordinator node. Then, the secondary coordinator node can use local snapshots, retries, and global deadlock detection to independently process transactions. This significantly simplifies the protocol messages between the primary and secondary coordinator nodes and improves performance.


Core protocol message	Description
GET_GXID	Assigns a GXID.
GET_GXID_MULTI	Batch assigns GXIDs.

GTM proxy
In the multi-coordinator architecture, GTM proxies serve as subprocesses of the postmaster process. GTM proxies have the following benefits:
- No additional roles are required, which makes management easier.
- GTM proxies and backend processes authenticate and trust each other by nature.
- GTM proxies communicate with backend processes by using the shared memory. Compared with TCP loopback, this method is more efficient because less memory needs to be copied and no network stack overheads are incurred.
Each GTM proxy establishes a connection with the GTM server and helps forward GTM requests of multiple backend processes to the GTM server. GTM proxies also perform the following request optimization operations:
- Share snapshots between backend processes to reduce the number of snapshot requests.
- Combine and batch process concurrent GTM requests from backend processes.
- Batch obtain GXIDs (in session consistency mode).
GTM proxies are key elements to reduce connections between the primary and secondary coordinator nodes and improve networking efficiency. If you use the strong consistency mode, AnalyticDB for PostgreSQL enables the GTM proxy on each secondary coordinator node to process requests between multiple backend processes and the GTM server. This further reduces the workloads of the GTM server.
Distributed transaction recovery
Distributed transactions need to be recovered in a variety of scenarios such as system restart, coordinator node restart, and failover between the primary and standby coordinator nodes. In the single-coordinator architecture, distributed transaction recovery falls into the following steps:
1. The primary coordinator node searches transaction logs (xlogs) for all transactions that have been prepared but have not been committed.
2. The primary coordinator node requests all compute nodes to commit all transactions that need to be committed.
3. The primary coordinator node collects and terminates all transactions that have not been committed by a compute node and are not included in the to-be-committed transaction list of the primary coordinator node.
If the multi-coordinator architecture is introduced, AnalyticDB for PostgreSQL must perform the following operations:
- Recovers the transactions initiated by secondary coordinator nodes.
- Recovers or clears the transactions that fail to be processed during the Prepare phase on compute nodes and secondary coordinator nodes in the case of primary or secondary coordinator node restart.
To perform these operations, AnalyticDB for PostgreSQL enhances the 2PC process and the distributed transaction recovery process.

AnalyticDB for PostgreSQL also enhances the transaction recovery process when both the primary and secondary coordinator nodes restart. Compared with the recovery process when a single primary coordinator node restarts, this enhancement can differentiate the distributed transactions initiated by each coordinator node by combining a MasterID with the original GXID.

Global deadlock detection

AnalyticDB for PostgreSQL V4.3 adds write locks to tables to prevent global deadlocks when UPDATE and DELETE operations are performed. This method provides lower performance for concurrent updates, so AnalyticDB for PostgreSQL V6.0 introduces global deadlock detection. The global deadlock detector (GDD) process collects and analyzes the lock wait information in an instance. If a deadlock is found, the GDD process terminates the process that caused the deadlock to improve the performance of simple query, insertion, deletion, and update operations in high concurrency scenarios.

AnalyticDB for PostgreSQL V6.0 uses the following procedure to implement global deadlock detection:

The GDD process runs on the primary coordinator node.
The GDD process obtains GXIDs of distributed transactions and their wait relationships from all compute nodes on a regular basis.
The GDD process checks whether a loopback exists based on all the transaction wait relationships. If a loopback exists, the GDD process rolls back a transaction to break through the deadlock.

AnalyticDB for PostgreSQL uses Get_gxids and Cancel_deadlock_txn to implement remote procedure calls in the multi-coordinator architecture.

Get_gxids: obtains the GXID list from each secondary coordinator node to determine the coordinator node that initiates the deadlocking transaction.
Cancel_deadlock_txn: requests the coordinator node that initiates the deadlocking transaction to roll back the transaction.

Support for DDL statements

In earlier versions, AnalyticDB for PostgreSQL uses the 2PC method to support DDL statements on the primary coordinator node and synchronize catalog information with compute nodes. In the multi-coordinator architecture, AnalyticDB for PostgreSQL enhances the 2PC method to synchronize catalog information with secondary coordinator nodes.

Secondary coordinator nodes can also process DDL statements. AnalyticDB for PostgreSQL uses a proxy in each secondary coordinator node to forward DDL requests to the primary coordinator node. The following figure shows the procedure.

Support for distributed table-level locks

In databases, locks are typically used to concurrently query table data. AnalyticDB for PostgreSQL uses the lock modes compatible with those of PostgreSQL. The following table describes the lock modes of AnalyticDB for PostgreSQL.


Requested lock mode	Current lock mode
Requested lock mode	ACCESS SHARE	ROW SHARE	ROW EXCLUSIVE	SHARE UPDATE EXCLUSIVE	SHARE	SHARE ROW EXCLUSIVE	EXCLUSIVE	ACCESS EXCLUSIVE
ACCESS SHARE	√	√	√	√	√	√	√	×
ROW SHARE	√	√	√	√	√	√	×	×
ROW EXCLUSIVE	√	√	√	√	×	×	×	×
SHARE UPDATE EXCLUSIVE	√	√	√	×	×	×	×	×
SHARE	√	√	×	×	√	×	×	×
SHARE ROW EXCLUSIVE	√	√	×	×	×	×	×	×
EXCLUSIVE	√	×	×	×	×	×	×	×
ACCESS EXCLUSIVE	×	×	×	×	×	×	×	×

Note

√ indicates that two locks can be held at the same time.
× indicates that two locks cannot be held at the same time.

After enhancement and adaptation to the original table-level lock rules, AnalyticDB for PostgreSQL defines new distributed table-level lock rules for the primary and secondary coordinator nodes in the multi-coordinator architecture.

Processes on a coordinator node request a lock at level 1 to 3:
- The table-level lock is requested on that node.
- The table-level lock is requested on all compute nodes.
- The table-level lock is released from all nodes at the end of the transaction.
Processes on the primary coordinator node request a lock at level 4 to 8:
- The table-level lock is requested on that node.
- The table-level lock is requested on all secondary coordinator nodes.
- The table-level lock is requested on all compute nodes.
- The table-level lock is released from all nodes at the end of the transaction.
Processes on a secondary coordinator node request a lock at level 4 to 8:
- The table-level lock is requested on the primary coordinator node.
- The table-level lock is requested on that node.
- The table-level lock is requested on all the other secondary coordinator nodes.
- The table-level lock is requested on all compute nodes.
- The table-level lock is released from all nodes at the end of the transaction.

AnalyticDB for PostgreSQL can respond to each table-level lock request on a coordinator or compute node to ensure compatibility with the original table-level lock rules.

Fault tolerance and high availability

AnalyticDB for PostgreSQL enhances its original fault tolerance and high availability capabilities for secondary coordinator nodes in the multi-coordinator architecture. If a secondary coordinator node fails, the control system can identify and repair the fault in real time.

AnalyticDB for PostgreSQL implements fault tolerance and high availability by using the following replication and monitoring processes:

The standby coordinator node provides a replica for the primary coordinator node, and secondary compute nodes provide replicas for primary compute nodes based on PostgreSQL stream replication.
FTS monitors the health status of coordinator and compute nodes. When a primary compute node fails, FTS performs failover between the primary and secondary compute nodes.

Fault tolerance and high availability are implemented based on the following steps:

Step A: The primary coordinator node synchronizes data to the standby coordinator by using stream replication.
Step B: Each primary compute node synchronizes data to its secondary compute node by using stream replication.
Step C: The FTS probe process establishes a connection from the primary coordinator node to primary compute nodes.
Step D: The FTS probe process establishes a connection from the primary coordinator node to the secondary coordinator node.
Step E: After the primary coordinator node restarts, the FTS probe process reports the status of all coordinator nodes to the GTM server.
Step F: The FTS probe process establishes a connection from the secondary coordinator node to the primary coordinator node and then obtains and saves the latest instance configuration and status information on the secondary coordinator node.
Step G: If the FTS probe process fails to connect from the secondary coordinator node to the primary coordinator node, the FTS probe process establishes a connection from the secondary coordinator node to the standby coordinator node. If the connection succeeds, the standby coordinator node takes over as the new primary coordinator node. Otherwise, the FTS probe process continues attempting to establish a connection to the original primary coordinator node.