Introduction to the MGR mode - ApsaraDB RDS - Alibaba Cloud Documentation Center

MySQL group replication (MGR) is a distributed replication mode that is provided by MySQL based on the existing binary logging mechanism. The MGR mode is implemented by using the Paxos protocol. ApsaraDB RDS for MySQL instances that run RDS Cluster Edition support MGR. This topic describes the advantages and implementation of the MGR mode. This topic also describes the optimizations that are made by AliSQL to improve the stability of the MGR mode.

Advantages

The following table compares MGR, semi-synchronous replication, and asynchronous replication in terms of data reliability, data consistency, and global transaction consistency.

Item	MGR	Semi-synchronous replication	Asynchronous replication
Data reliability	★★★★★	★★★	★
Data consistency between the primary and secondary nodes	Ensured	Not ensured	Not ensured
Global transaction consistency	Supported	Not Supported	Not Supported

High data reliability

The MGR mode uses the majority rule of the Paxos protocol to ensure high data reliability. The majority rule specifies that a transaction can be committed on each node of an RDS cluster only after a majority of nodes in the RDS cluster receive the binary logs of the transaction. This prevents data loss when a node in your RDS cluster becomes faulty.

For example, an RDS cluster contains five nodes. Three nodes have received the binary logs of a transaction, and two nodes have not received the binary logs. Two nodes are faulty.

If the faulty nodes have received the binary logs, at least one node that has received the binary logs is running as normal.
If the faulty nodes have not received the binary logs, three nodes that have received the binary logs are running as normal.

Note

Majority: specifies more than half of the nodes in an RDS cluster.
Minority: specifies less than half of the nodes in an RDS cluster.

Strong data consistency

In MGR mode, transactions are committed to secondary nodes in an RDS cluster and then written to binary log files. If the primary node becomes faulty, the amount of data on the faulty primary node is no more than the amount of data on the newly elected primary node after the faulty primary node is restarted. When the faulty primary node is restarted, it automatically joins the RDS cluster, and the missing binary logs are automatically synchronized to the restarted primary node to ensure data consistency between the primary and secondary nodes.

In primary/secondary replication mode, transactions are written to binary log files and then committed to secondary nodes. If the primary node becomes faulty after the transactions are written to binary logs but before they are committed to secondary nodes, the amount of data on the faulty primary node is greater than the amount of data on secondary nodes after the faulty primary node is restarted. In this case, when one of the secondary nodes is elected as the new primary node, the amount of data on the new primary node is less than the amount of data on the original primary node. This causes data inconsistency between the primary and secondary nodes.

Global transaction consistency

The MGR mode provides strong global consistency for read and write operations among nodes. You can use the group_replication_consistency parameter to specify the consistency levels for read and write operations based on your business requirements.

Strong read consistency for secondary nodes: You can set the session-level group_replication_consistency parameter to BEFORE for the secondary node of your RDS cluster. In this case, if you run Query A on the secondary node, Query A is run only after the required transactions are applied on the primary node. The required transactions indicate the transactions that involve queries run before Query A. This way, the data that is read from the secondary and primary nodes is consistent.
Strong write consistency for the primary node: You can set the session-level group_replication_consistency parameter to AFTER for the primary node of your RDS cluster. In this case, if you commit a write transaction, the transaction is committed after the transaction is applied to all nodes in the RDS cluster.

Deployment methods

An RDS cluster that uses the MGR mode supports the single leader mode and the multiple leader mode.

Multiple leader
All nodes in your RDS cluster can process read and write requests. The multiple leader mode is used to increase the write capability of your RDS cluster. The multiple leader mode adopts the multi-point data writing strategy of the Paxos protocol and uses conflict detection on row-level conflicts. This ensures that all nodes receive data in the same order to implement multi-point writes.
However, the stability of an RDS cluster in multiple leader mode is poor. When a majority of nodes in your RDS cluster are available, faults on a node affect the availability of the RDS cluster for a short period of time.
Single leader
Only one node in your RDS cluster can process write requests. Other nodes in the RDS cluster process only read requests. The single leader mode adopts the single leader replication strategy of the Paxos protocol. This increases the read capability, improves the data reliability, and maintains the high availability of your RDS cluster.
- When a majority of nodes in your RDS cluster are available, if a secondary node in the RDS cluster becomes faulty, the availability of the RDS cluster is not affected.
- If the primary node is faulty, the RDS cluster automatically completes a primary/secondary switchover based on the Paxos protocol while strong data consistency in the RDS cluster is ensured.
ApsaraDB RDS for MySQL allows you to create an RDS cluster that uses the MGR mode in single leader mode. In the RDS cluster, read-only nodes are optimized to increase the performance of the RDS cluster and ensure high data reliability and strong data consistency.

Architecture

The architecture of MGR has the following layers under the server layer and replica layer of MySQL:

Group replication logic layer: This layer is added under the server layer of standalone MySQL and connected to the group communication system (GCS) layer by using a hook. This layer can send and receive transactions to and from the GCS layer and play back the transactions.
GCS layer: This layer works with the XCom layer to enable the communication between the group replication logic layer and the cluster. The GCS layer also detects faults and manages cluster members.
XCom layer: This layer is developed based on the Paxos protocol. The XCom layer works with the GCS layer to enable the communication between the group replication logic layer and the cluster. The XCom layer ensures the global order of data sending and can change the roles of cluster members. The XCom layer also ensures that all nodes receive data in the same order. When a majority of nodes in the cluster are available, the XCom layer can help prevent data loss.

XCom layer

Paxos protocol

The following list describes the functionalities of the Paxos protocol in MGR mode:

Ensures that all nodes in a cluster receive the binary logs of a transaction in the same order, which is essential to multiple leader mode.
Ensures that a transaction can be committed only after a majority of nodes in the cluster receive the binary logs of the transaction, which is essential to data reliability. When a majority of nodes in the cluster are available, no data is lost.

In the Paxos protocol, locks are used to achieve the consistency in data sending orders between nodes. This method is inefficient and causes unbalanced loads among nodes. The XCom layer of MySQL relies on the Mencius protocol, which is a Paxos-based variant protocol. The Mencius protocol is a leaderless Paxos protocol that is developed by using a polling mechanism. This protocol helps effectively balance loads among nodes.

Implementation of multiple leader mode

Multiple leader mode is implemented based on the Mencius protocol. In the Mencius protocol, each node in a cluster proactively establish connections to other nodes to lead a single-leader Paxos group. If a cluster contains n nodes, n Paxos groups that do not interfere with each other are formed. In each Paxos group, only the leader node can send data. When a majority of the rest nodes receive the data, the data is successfully sent. When the client on a node receives data, the node sends the data as a leader in its Paxos group to other nodes in the group in a serial manner. This ensures the same data sending order in each Paxos group.

When data is sent from multiple Paxos groups, a polling mechanism is used to ensure the data sending order. After receiving data from multiple Paxos groups, the XCom layer sends the data to the group replication logic layer in a specified order. The preceding figure shows that data must be sent to the group replication logic layer in the order of (1,1), (1,2), and (1,3).

If data of the nodes that are listed after a node is received by a majority of nodes in the cluster and no data on the node needs to be sent, the node broadcasts the noop state to notify other nodes of skipping the node itself. A node can send data only after the node that is listed before the node sends its data or broadcasts the noop state. When jitters or faults are detected on a node, the node cannot send data and broadcast the noop state. In this case, the nodes that are listed after the node cannot send data, and the cluster is unavailable. This is a critical defect of multiple leader mode.

Note

In the preceding figure, (m,n) indicates Group n sends the mth data record. For example, (2,1) indicates Group 1 sends its second data record.

Implementation of single leader mode

The defect of multiple leader mode can be optimized, but cannot be eliminated. Therefore, MySQL provides single leader mode for the MGR mode. This helps prevent a cluster from being unavailable when a minority of nodes in the cluster are faulty.

The preceding figure shows the XCom architecture in single leader mode in which only one node can process write requests. Therefore, you need to activate only one Paxos group. The receiver automatically ignores other Paxos groups when data is polled. This way, the Paxos protocol can be used to send data without affecting the availability of the cluster as long as a majority of nodes in the cluster are available.

In single leader mode, secondary nodes in a cluster do not send transactions. Secondary nodes send information about cluster management. Before a secondary node sends data, the secondary node must request a location for data sending from the primary node, and then send the data to all nodes in the cluster. For example, <3,1> in the preceding figure shows a location for data sending. Although data is sent at a high latency and low efficiency in single leader mode, the performance of the cluster is not affected because information about cluster management is sent at a relatively low frequency.

Group replication logic layer

The group replication logic layer sends and receives transactions to and from a cluster, and plays back the transactions. The following list describes how the group replication logic layer works on the primary and secondary nodes:

Primary node: Before a transaction is committed on the primary node, the group replication logic layer sends the binary logs of the transaction to the XCom layer and then to other nodes in the cluster. After a majority of nodes in the cluster receive the transaction, conflict detection is performed. If the transaction passes the conflict detection, the transaction is written to the binary log file and committed on the primary node. If the transaction fails the conflict detection, the transaction is rolled back.
Secondary node: After a majority of nodes in the cluster receive the transaction, the group replication logic layer sends the transaction from the XCom layer to the group replication logic layer to perform the conflict detection. If the transaction passes the conflict detection, the transaction is written to the relay log and then applied by the applier thread. If the transaction fails the conflict detection, the data of the transaction is discarded.

Conflict detection

Scenarios
In MGR mode, transactions on different nodes need to be executed on the same node in some scenarios. Therefore, conflict detection must be performed on the transactions of the nodes to ensure that the transaction modifications do not conflict with each other. The following list describes the scenarios:
- In multiple leader mode, conflict detection is required for all write operations.
- In single leader mode, if a primary/secondary switchover is performed and write transactions are executed before the relay logs of the original primary node are applied on the new primary node, conflict detection is also required.
Implementation
In MGR mode, row-level conflict detection is performed by using the hash value of the primary key of a data row. Each node maintains an array of transaction authentication information, which is a hash array. The hash array uses the hash value of a data row as the key and the union of the global transaction identifier (GTID) of the current transaction that modifies the data row and gtid_executed obtained before the current transaction is committed on the source node as the value. gtid_executed specifies the GTID set of all transactions that have been committed on the source node.
Before a transaction is committed on the source node, the system sends the data that is modified by the transaction and gtid_executed on the source node to other nodes in the cluster. gtid_executed includes the transactions that are committed on the source node before the current transaction is committed. gtid_executed is hereinafter referred to as the commitment set.
All nodes including the source node in the cluster use the hash values of all data rows that are modified by the current transaction as keys, read the values of the keys from the authentication information array, and put these values into a GTID set. The GTID set specifies the transactions that must be committed before the current transaction is committed. This GTID set is hereinafter referred to as the dependency set.
Before the current transaction is committed or written to the relay log, the system compares the commitment set and dependency set from the following aspects:
- If the commitment set contains the dependency set, all transactions that modified the data rows are committed. In this case, the current transaction passes the conflict detection. The system writes the current transaction to the binary log, commits the transaction on the source node, and writes the current transaction to the relay logs on other nodes.
- If the commitment set does not contain the dependency set, the transactions that modified the data rows are not committed. In this case, the current transaction fails the conflict detection. The system rolls back the current transaction on the source node and discards the relay logs on other nodes.
Redundant data in the authentication information array must be deleted at the earliest opportunity to reduce storage usage. When a transaction is executed on all nodes in a cluster, the transaction does not conflict with other transactions. In this case, all data rows that are modified by the transaction can be deleted from the authentication information array. In MGR mode, the data of transactions that are executed is deleted every 60 seconds.

Optimizations made by AliSQL on the stability of the MGR mode

The stability of MGR is greatly improved when single leader mode is used. However, stability issues still exist in some scenarios. When a secondary node has a high latency, a large number of transactions cannot be applied in a timely manner. As a result, a large amount of authentication information is accumulated and the following issues may occur:

A large amount of memory is occupied, and an out-of-memory (OOM) error may occur.
The overheads for deleting the accumulated authentication information are high, which affects the stability of the cluster.

The following list describes how AliSQL optimizes the deletion of the accumulated authentication information on the primary and secondary nodes:

Primary node: The authentication information array is not used in any cases. Therefore, the authentication information array can be deleted from the primary node to eliminate negative impacts on the resources and stability of the primary node.
Secondary node: Authentication information must be retained only when the group_replication_consistency parameter is set to EVENTUAL. If the group_replication_consistency parameter is set to EVENTUAL, a secondary node immediately provides external services after the secondary node is elected as the primary node without the need to wait until the relay log is played back. This may cause data operation conflicts. This setting is not commonly used. If the group_replication_consistency parameter is not allowed to be set to EVENTUAL, the amount of authentication information retained on the secondary node is reduced. This reduces the memory consumption of the secondary node and improves the stability of the cluster.