Cloud Database Open Source Release: PolarDB Three-Node High Availability Features and Key Technologies

Cloud Database Open Source Release Introduction:

Cloud Database Open Source Release. At the Alibaba Cloud Open Source PolarDB Enterprise Architecture Conference on March 2, Alibaba Cloud database technology expert Meng Borong gave a wonderful speech on the theme of " High Availability of PolarDB Three Nodes". The three-node high-availability function mainly provides PolarDB with financial-grade strong consistency and high-reliability cross-machine room replication capabilities. Based on the distributed consensus algorithm, the database physical logs are synchronized, automatic failover, and zero data loss after any node failure. This topic mainly introduces the features and key technologies of PolarDB's three-node high availability.

Cloud Database Open Source Release.PolarDB Enterprise Architecture Conference on
March 2, Alibaba Cloud database technology expert Meng Borong gave a wonderful speech on the theme of " High Availability of PolarDB Three Nodes". The three-node high-availability function mainly provides PolarDB with financial-grade strong consistency and high-reliability cross-machine room replication capabilities. Based on the distributed consensus algorithm, the database physical logs are synchronized, automatic failover, and zero data loss after any node failure. This topic mainly introduces the features and key technologies of PolarDB's three-node high availability.
Live review video: https://developer.aliyun.com/topic/PolarDB_release
PDF download: https://developer.aliyun.com/topic/download?id=8346

Cloud Database Open Source Release.The following is organized according to the video content of the conference speech:
three-node high-availability function of PolarDB for PostgreSQL mainly combines physical replication with consistency protocols to provide PolarDB with financial-grade strong consistency and highly reliable cross-machine room replication capabilities.

PG's native stream replication supports asynchronous/synchronous/Quorum three synchronization methods.
The main goal of synchronous replication is to ensure that data is not lost, but it also brings three problems:

①The availability requirements cannot be met. When the standby database fails or the network link is jittered, it will affect the availability of the main database, which is unacceptable for the production environment. Secondly, it cannot completely guarantee that the data will not be lost. The solution of synchronous replication to ensure that the data is not lost is that the transaction of the main database cannot be committed before the RW log is fully persisted on the standby machine. In some extreme cases, such as the main database has written the WAL log, and the main database restarts while waiting for the standby database to synchronize the WAL log, then during the restart process, the log playback process does not wait for the standby database to persist . of. Therefore, after the playback is completed, it is possible that the standby database is not persistent, and the log is already visible to the outside world after the playback is completed on the main database.

②Cloud Database Open Source Release. It does not have the ability to automatically switch over faults. Capabilities such as automatic switching and availability detection all depend on external components.

③ Cloud Database Open Source Release.The old main library may not be able to join the cluster directly after the failure recovery. For example, when the WAL log of the transaction on the primary database has been persisted, the standby database has not yet received the log or has not been persisted. At this time, if the main database fails, after the standby database is switched to the main database, after the old main database is re-run, because there are redundant WAL logs before the restart, the logs cannot be directly pulled from the main database, and must rely on the Other tools can only join the cluster after processing its consistency.

Cloud Database Open Source Release.Compared with synchronous replication, asynchronous replication has better performance and higher availability, because the failure of the standby machine or the jitter of the network link will not affect the main database, but its biggest problem is data loss. For example, the data that can be seen on the main database does not exist on the standby database after the switch. Secondly, it does not have the ability of automatic failover and automatic detection, and the main library after the switch cannot be automatically added to the cluster.

Cloud Database Open Source Release.After Quorum replication uses the majority scheme, it may also ensure that no data is lost, but it does not involve how to select a new host when a host fails; secondly, when the logs of each node are inconsistent, how to ensure the consistency of the logs Third, when the cluster changes, how to ensure the final consistency of the cluster state. For the above problems, Quorum replication does not provide a complete solution. So in essence, PG's Quorum replication is not a complete high-availability solution that does not lose data.

Our solution is to introduce X- Paxos , Ali's internal consistency protocol, to coordinate physical replication. X- Paxos has been running stably on Alibaba's internal and Alibaba Cloud products for a long time, so its stability is guaranteed. Its consensus protocol algorithm is similar to other protocols .
The entire high-availability solution is a single-point write, multi-point read cluster system. The Leader node acts as a single-point write node to provide external read and write services, and synchronizes to other nodes after generating WAL logs. Follower mainly accepts WAL logs from Leader nodes, and replays them to provide read-only services to the outside world.

Then its main capabilities include the following three aspects:

Ensure strong consistency of data in the cluster, that is, RPO=0. When the WAL log of the majority node is successfully written, it is considered that the log has been successfully submitted at the cluster level. In the event of a failure, other Follower nodes automatically align logs with the Leader node.

Automatic failover. In a high-availability cluster, as long as more than half of the nodes survive, the cluster can provide normal external services. Therefore, when a few failovers fail or a few nodes fail to connect to the network, the service capability of the cluster will not be affected.

When the leader node fails or the network with the majority node fails, it will automatically trigger the cluster to re-select the master process , and the new master will provide read and write services to the outside world. In addition, the follower node will also automatically synchronize the WAL log from the new leader node, and automatically align with the new leader log. At this time, if there are more logs on the follower than on the new leader, the WAL logs will be automatically aligned from the new leader.

Online cluster change can support online addition and deletion of nodes, manual switching, and role change, such as switching from the leader to the follower role. In addition, all nodes can be supported to set election weights, and nodes with high election weights will be selected as the master first. At the same time, the cluster change operation does not affect the normal operation of the business, and the realization of this capability is guaranteed by the consistency protocol. In the end, the configuration in the cluster is consistent, and there will be no state inconsistency due to abnormal conditions in the cluster configuration process.

A new role has been added to the three-node high availability feature: the Learner node. It does not have majority decision-making power, but can provide read-only services. The log synchronization state of the Learner node has nothing to do with the Leader, nor does it affect the Leader. Its main functions are as follows:

①As the intermediate state of the add node. For example, the newly added leader node has a relatively large delay. If it is directly added to the majority, it will affect the submission of the majority. Therefore, first join the cluster as a learner to synchronize data, and when its data basically catches up with the leader, it will be promoted to a follower node.

②As a remote disaster recovery node . It will not affect the availability of the main database. After the leader switch occurs, it can automatically synchronize logs from the new account without external intervention.

In terms of cluster deployment, it can support deployment across computer rooms and domains, including three copies in the same computer room, three copies in three computer rooms in the same city, five copies in three computer rooms in two places, and five copies in three computer rooms in three places, etc. In addition, you can also use the Learner node for disaster recovery across domains, which will not affect the availability of the Leader node.

Cloud Database Open Source Release.In addition, it is compatible with PG's native stream replication and logical replication, which can ensure that downstream consumption is not affected and that uncommitted data will not appear downstream.
As can be seen from the previous introduction, in PolarDB 's high-availability solution, at least three copies of data must be stored, and the storage cost will increase. For this problem, we provide two solutions:

First, improve the utilization of resources. The Follower node can be used as a read-only node to provide read services, thereby increasing the read expansion capability of the entire cluster; in addition, it supports cross-node parallel query capabilities and can make full use of the resources of each base node.

Cloud Database Open Source Release.Second, a log node is introduced to reduce resource usage. The log node itself does not store data, it only stores real-time WAL logs, and only serves as one of the majority nodes for log persistence. The log node itself also has complete log replication capabilities, is compatible with native streaming replication and logical replication, and can be used as the source of downstream log consumption, thereby reducing the log transmission pressure on the leader node. The network specifications or other resources of the log node can be customized according to the needs of downstream log consumption.

The basic principles of consensus protocol replication mainly include three aspects:
①Transfer or synchronize WAL logs through native asynchronous streaming replication.
②The commit site of the cluster is pushed by the consensus protocol.
③ For the problem of automatic failover, the state change at the database level is driven according to the change of its own state at the consistency protocol level. For example, after the heartbeat times out, it may be automatically downgraded.

In terms of specific implementation, Consensus Log is used as a carrier to advance the submission site. A corresponding Consensus Log Entry is generated for each WAL log, which records the end LSN of the WAL log. Then a persistence dependency is introduced to ensure that when each Log Entry is persisted, the WAL log of the corresponding location on the node has been successfully persisted.
After the introduction of the above two mechanisms, if the consensus protocol level believes that the Consensus Log has been successfully submitted, it means that the Consensus Log has been successfully persisted in the majority, and the WAL log of the corresponding site must have also been successfully persisted.

Cloud Database Open Source Release.As an example in the above figure, the Leader has persisted three WAL logs. On the Follower 1 node, although the WAL log of the log entry has been successfully persisted, its corresponding Consensus Log has not been successfully persisted, so the consistency protocol is It is believed that this Consensus Log has not been persisted successfully. The Log Entry and Consensus Log on Follower 2 are not persistent. Its WAL log is only persistent for a period, and its WAL log segment is not persisted successfully. Therefore, according to the consensus protocol, the log of the current LogIndex 2 has been successfully written on the majority node, the CommitIndex of the current Consensus Log is 2, and the corresponding Commit LSN is 300.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00