It is a basic requirement for modern databases to store data and keep servicing when a part of the machines is hung up.
Master-slave architecture was used for earlier standalone databases. This kind of architecture has many unsolved defects. There are various optimization terms, including:
Master-slave architecture is the hardest hit area for CAP theory to make trade-offs, and consistency and availability are particularly contradictory. The balance between consistency and availability is impossible.
Best Practice: In this era, we should not choose a product with master-slave architecture if data is important to a certain extent.
In addition to coping with extensibility, distributed databases are designed to solve the problem of disaster recovery of the traditional master-slave structure. Paxos/Raft (including other variants) has become the mainstream choice. The common point is there are three copies of each data, and it can ensure the availability will not be affected in the case of one copy hanging, and there will be no inconsistency (split brain, lost data, lost updates, etc.).
This article is not intended to analyze the Paxos/Raft protocol because there are many such articles. However, we may ask a question. Can a database be secure, stable, and reliable as long as the Paxos/Raft protocol is used?
We will explore the following questions:
Let's start with the simplest example. Suppose there are the following database structures:
What is the availability in the case of having one node and two nodes respectively?
When one machine is hung, the availability is 100%. When two machines are hung, the availability is 0%, and all data are unavailable. If the two machines are destroyed, it means all data are lost.
Suppose we have six machines, and we want to deploy a database on these six machines to improve extensibility and disaster tolerance:
If we cut the data into 12 shards, there are 12*3=36 copies.
Let's do a mapping game and fill the copies into the six machines (the constraints above must be met, and each machine is assigned to six copies). How many different filling methods come to mind?
We evenly distribute 12 leader copies in the cluster, and each node is assigned two leader copies. Except for these two conditions, the scheduling between shards is unrelated.
This model is commonly used in distributed databases that use distributed KV + SQL.
As such, let's analyze what will happen if two machines are hung. For example, node 1 and node 2 are down at the same time. Since p1 and p6 have two copies on these two nodes, p1 and p6 are in the unavailable state.
Let's enumerate the possible scenarios:
A more intuitive fact is that, in this model, any two hung machines will cause some data to be unavailable.
The number of shards is not 12 but M, and the number of machines is not 6 but N. Let's do some simple computing.
1. If two machines are hung, the expected value of unavailable shards is .
Simple derivation process:
For a specific shard, the probability of hanging one copy is
, and the probability of hanging another copy is
Therefore, the probability of a specific shard hanging is
The number of hanging in M shards is simply multiplied by the probability of M.
If M=12 and N=6, the expected value of the number of unavailable shards is
6⋅12/（6⋅（6-1））=2.4, which is consistent with the specific table in our column above.
2. The probability that no unusable or lost data shards exist is
Simple derivation process:
For a particular shard, the probability that it is available is
The probability that all M shards are available is its M power.
If M=12, N=6, the fully available probability is
3. If the number of machines increases, the data volume increases, and the number of shards increases. M and N are proportional to each other. The number of copies on a single node is represented as m. Then,
The probability of full availability if hanging two machines at will is
Taking a cluster with N=50 nodes as an example, let's look at the probability relationship between the value of m and availability:
It can be found that availability will decrease rapidly as m increases. Availability is highest at 96% when m=1 (if each machine only has one copy).
Under what circumstances will m become larger?
If the size of a single shard is limited to a small size when designing a database, m is tended to be large. For example, limit a single shard to 96M. If there are 50 96Ms, they only have about 5G data, each node has only 100G data, and then there are massive 1000 shards.
In Mode 1, the copy distribution of each shard (or each Paxos protocol group) is not coordinated. If a single node has a slightly large number of shards, as long as two or more machines hang at the same time, some data will almost certainly be unavailable or lost.
Let's return to the example above. 2.4 of the 12 shards are unavailable, which seems to be acceptable because 33% of the machine is hung up after all. They only affect 20% of the business.
We need to make something clear: 20% of unavailability is only at the data level. For example, if we think of the business as a KV database, only 20% of the data is not available.
However, the availability of data and business are often not equivalent!
The first point is obvious. Behind the unavailability, one situation is data loss. For some businesses (such as bank account balances), no matter how low the loss rate is, it is unacceptable. For such businesses, the probability of data loss must be infinitely close to zero.
The second point is more subtle.
We know that most businesses do not directly access KV databases. In general, we will build a relational database on KV. The relational database has two more concepts than KV databases, which are tables and indexes. From a business perspective, a business request operates on multiple indexes of multiple tables.
For example, after a business receives an HTTP request, it performs the following operations.
begin; ## Query the user's balance based on the index on user_id, and this query requires a table return for the primary key (involving 2 keys). select * from user_info where user_id=xxx; ## Deduct inventory by item_id (involving 1 key). update items set count=count-1 where item_id=xxx ## Write a record to the order table (note that the order table has five secondary indexes, which means we need to write six keys). insert into orders xxx; commit;
In this HTTP request processing, 9 keys of 9 indexes (primary keys or secondary indexes) of three tables are involved. There is no correlation between these 9 keys, which means they are evenly distributed in the cluster.
Although the availability of each key (the shard where the key is located) is as high as 80%, the index is a terrible thing because it can magnify all our bad luck. Even though the 9 keys are lucky, the probability of all being available is only 0.8^9=13%.
From a business perspective, the success rate of an HTTP request is only 13%, and the failure rate is 87%. The unavailability the KV layer sees is 20% versus what the business sees, which is 87%.
Two conclusions are listed below:
In model 1, after the value of m rises, the availability will decline rapidly due to the exponential relationship. If m is only related to the number of nodes n and not to the number of single-node copies m, the availability can be maintained at a good number.
In Model 2, three nodes are grouped together. Each node in each group contains copies from the same shard, as shown in the preceding figure. p1-p6 and p7-p12 each belong to a group. Let's enumerate the possible scenarios.
Compute the probability that two machines hang and data is fully available or not lost (no unavailable or lost data shards exist):
The simple derivation process is listed below:
For a particular group, the probability that it is available is . The probability of N/3 available groups is the power of N/3.
We can see that as the number of machines increases, availability increases rapidly, exceeding 90% at 20 nodes.
Compared with model 1, in model 2, the scheduling between different shards (or different Paxos protocol groups) has a certain binding relationship, making them only linearly related to the number of nodes, which significantly improves the availability.
Note: The group mentioned here is not the Paxos group or Raft group. It is a kind of scheduling. An intuitive description of the relationship between node, group, Paxos/Raft group, and shard.
In databases with the concept of group, it has different names. For example, for PolarDB Distributed Edition, a group is called DN. Each DN has three nodes (three processes), each DN has a Paxos group, and each DN (Paxos group) has several shards.
Model 2 significantly reduces the risk of a single-shard failure. However, as we said above, multiple transactions, tables, and indexes in a business request can amplify this unavailability. For businesses with high stability requirements, distributed databases need to provide more capabilities to reduce such risks.
PolarDB Distributed Edition uses table group technology to resolve this issue. The core idea is to bind the data involved in the same service request together as much as possible and schedule it to a similar location so its availability is unaffected by the amount of data involved.
For example, if a service has the concept of user, we can use tools (such as partition tables, table groups, and local indexes) to bind multiple tables, records, and indexes involved in a user to the same DN. The tables include user details, operation logs, and address lists.
Self-built distributed databases on the cloud have some additional problems.
Typically, we purchase a batch of ECSs and deploy distributed databases on the ECSs.
A big risk here is that ECSs are virtualized (unless we purchase an ultra-large specification of an exclusive physical machine), and multiple ECSs will be virtualized on one physical machine. Ordinary cloud users cannot perceive the scheduling strategy of ECSs.
For example, the ECSs that contain multiple nodes or multiple copies of the same data may be located on the same physical machine.
The physical distribution of nodes cannot be perceived or controlled by self-built databases.
In addition, self-built databases are prone to the following problems:
The preceding issues significantly increase the risks and probability of failures of self-built distributed databases on the cloud.
Generally, there are several solutions:
a) Still unable to control the physical distribution in the same zone
b) Changes in response time due to multiple zones
b) High O&M Requirements
c) Still unable to control racks, switches, etc.
Distributed database services provided by cloud vendors can solve such problems. For example, the PolarDB distributed edition service on the cloud can do the following:
Such risks increase exponentially in self-built distributed databases on the cloud.
Model 1 is clean and easy to layer. Each shard can be scheduled freely without considering the location relationship between indexes. Therefore, Model 1 is especially suitable for toy-level projects in the database field. However, when used in relational databases, it is an empty shell because a large number of tables and secondary indexes will dramatically amplify this effect. In the face of multi-machine failure, the probability of no problem will rapidly drop to zero as the amount of data and the number of nodes grow.
Model 2 is more suitable for production-level databases. If you want to use a distributed database, it is recommended to carefully identify whether it has a scheduling strategy similar to Model 2.
Table groups, local indexes, and other capabilities are indispensable for production-level databases. Proper use of these capabilities can avoid exponentially amplifying unavailability. We recommend choosing a database with these capabilities for applications that require higher stability.
Now, we can answer the questions at the beginning of the article.
Stay tuned. We will bring more thoughts on the design and solution selection of production-level databases in subsequent articles.
ApsaraDB - October 19, 2020
Alibaba Developer - November 10, 2021
OceanBase - August 26, 2022
ApsaraDB - July 7, 2022
Alibaba Cloud Native Community - October 26, 2023
Alibaba Clouder - February 3, 2021
Designed to address database challenges such as ultra-high concurrency, massive data storage, and large table performance bottlenecks.Learn More
Alibaba Cloud PolarDB is a cloud-native relational database service that decouples computing resources from storage resourcesLearn More
Mitigate the scalability problem of single machine relational databases for large-scale online databases.Learn More
A ledger database that provides powerful data audit capabilities.Learn More
More Posts by ApsaraDB