About Database Kernel | What Happens to Distributed Databases When Two Machines Are Hung Up?

It is a basic requirement for modern databases to store data and keep servicing when a part of the machines is hung up.

Master-slave architecture was used for earlier standalone databases. This kind of architecture has many unsolved defects. There are various optimization terms, including:

Split-Brain: Conflicting data written by two nodes at the same time cannot be merged, and a part of the data must be lost. What should we do if we expect no split-brain? The answer is to sacrifice availability.
Synchronous Replication: When the slave machine is unavailable, is the write successful? Yes, but data may be lost. No, the slave machine is unavailable == The cluster is unavailable, sacrificing availability.
Asynchronous replication, which gives up consistency
The semi-sync belongs to master-slave architecture.
Businesses are fault-tolerant and make reconciliation and compensation plans for their business scenarios.

Master-slave architecture is the hardest hit area for CAP theory to make trade-offs, and consistency and availability are particularly contradictory. The balance between consistency and availability is impossible.

Best Practice: In this era, we should not choose a product with master-slave architecture if data is important to a certain extent.

In addition to coping with extensibility, distributed databases are designed to solve the problem of disaster recovery of the traditional master-slave structure. Paxos/Raft (including other variants) has become the mainstream choice. The common point is there are three copies of each data, and it can ensure the availability will not be affected in the case of one copy hanging, and there will be no inconsistency (split brain, lost data, lost updates, etc.).

This article is not intended to analyze the Paxos/Raft protocol because there are many such articles. However, we may ask a question. Can a database be secure, stable, and reliable as long as the Paxos/Raft protocol is used?

We will explore the following questions:

In addition to the protocol, what factors affect the availability of distributed databases?
How can we compute the availability of distributed databases with different architectures?
Is the availability of the KV layer equivalent to that of relational databases?
Is the availability of databases equivalent to that of applications?

Start with 1+1=2

Let's start with the simplest example. Suppose there are the following database structures:

What is the availability in the case of having one node and two nodes respectively?

When one machine is hung, the availability is 100%. When two machines are hung, the availability is 0%, and all data are unavailable. If the two machines are destroyed, it means all data are lost.

Scheduling Is a Mapping Game

Suppose we have six machines, and we want to deploy a database on these six machines to improve extensibility and disaster tolerance:

We will shard the data to improve the extensibility.
Each shard needs to have three copies.
Three copies of each shard need to be allocated to different machines to ensure disaster recovery.

If we cut the data into 12 shards, there are 12*3=36 copies.

Let's do a mapping game and fill the copies into the six machines (the constraints above must be met, and each machine is assigned to six copies). How many different filling methods come to mind?

Model 1: A Completely Random Scheduling

We evenly distribute 12 leader copies in the cluster, and each node is assigned two leader copies. Except for these two conditions, the scheduling between shards is unrelated.

This model is commonly used in distributed databases that use distributed KV + SQL.

As such, let's analyze what will happen if two machines are hung. For example, node 1 and node 2 are down at the same time. Since p1 and p6 have two copies on these two nodes, p1 and p6 are in the unavailable state.

Let's enumerate the possible scenarios:

A more intuitive fact is that, in this model, any two hung machines will cause some data to be unavailable.

The number of shards is not 12 but M, and the number of machines is not 6 but N. Let's do some simple computing.

1. If two machines are hung, the expected value of unavailable shards is .

Simple derivation process:

For a specific shard, the probability of hanging one copy is

, and the probability of hanging another copy is

Therefore, the probability of a specific shard hanging is

The number of hanging in M shards is simply multiplied by the probability of M.

If M=12 and N=6, the expected value of the number of unavailable shards is 6⋅12/（6⋅（6-1））=2.4, which is consistent with the specific table in our column above.

2. The probability that no unusable or lost data shards exist is

Simple derivation process:

For a particular shard, the probability that it is available is

The probability that all M shards are available is its M power.

If M=12, N=6, the fully available probability is

3. If the number of machines increases, the data volume increases, and the number of shards increases. M and N are proportional to each other. The number of copies on a single node is represented as m. Then,

The probability of full availability if hanging two machines at will is

Taking a cluster with N=50 nodes as an example, let's look at the probability relationship between the value of m and availability:

It can be found that availability will decrease rapidly as m increases. Availability is highest at 96% when m=1 (if each machine only has one copy).

Under what circumstances will m become larger?

If the size of a single shard is limited to a small size when designing a database, m is tended to be large. For example, limit a single shard to 96M. If there are 50 96Ms, they only have about 5G data, each node has only 100G data, and then there are massive 1000 shards.

In Mode 1, the copy distribution of each shard (or each Paxos protocol group) is not coordinated. If a single node has a slightly large number of shards, as long as two or more machines hang at the same time, some data will almost certainly be unavailable or lost.

Data Availability = Business Availability

Let's return to the example above. 2.4 of the 12 shards are unavailable, which seems to be acceptable because 33% of the machine is hung up after all. They only affect 20% of the business.

We need to make something clear: 20% of unavailability is only at the data level. For example, if we think of the business as a KV database, only 20% of the data is not available.

However, the availability of data and business are often not equivalent!

The first point is obvious. Behind the unavailability, one situation is data loss. For some businesses (such as bank account balances), no matter how low the loss rate is, it is unacceptable. For such businesses, the probability of data loss must be infinitely close to zero.

The second point is more subtle.

We know that most businesses do not directly access KV databases. In general, we will build a relational database on KV. The relational database has two more concepts than KV databases, which are tables and indexes. From a business perspective, a business request operates on multiple indexes of multiple tables.

For example, after a business receives an HTTP request, it performs the following operations.

begin;
## Query the user's balance based on the index on user_id, and this query requires a table return for the primary key (involving 2 keys).
select * from user_info where user_id=xxx;
## Deduct inventory by item_id (involving 1 key).
update items set count=count-1 where item_id=xxx
## Write a record to the order table (note that the order table has five secondary indexes, which means we need to write six keys).
insert into orders xxx;
commit;

In this HTTP request processing, 9 keys of 9 indexes (primary keys or secondary indexes) of three tables are involved. There is no correlation between these 9 keys, which means they are evenly distributed in the cluster.

Although the availability of each key (the shard where the key is located) is as high as 80%, the index is a terrible thing because it can magnify all our bad luck. Even though the 9 keys are lucky, the probability of all being available is only 0.8^9=13%.

From a business perspective, the success rate of an HTTP request is only 13%, and the failure rate is 87%. The unavailability the KV layer sees is 20% versus what the business sees, which is 87%.

Two conclusions are listed below:

Introducing relational models (such as tables and indexes) can exponentially amplify the unavailability of the KV layer.
In general, multiple tables, indexes, and keys are operated in a single request, and the unavailability of the KV layer is exponentially increased.

Model 2: Scheduling with Certain Binding Relationships between Shards

In model 1, after the value of m rises, the availability will decline rapidly due to the exponential relationship. If m is only related to the number of nodes n and not to the number of single-node copies m, the availability can be maintained at a good number.

In Model 2, three nodes are grouped together. Each node in each group contains copies from the same shard, as shown in the preceding figure. p1-p6 and p7-p12 each belong to a group. Let's enumerate the possible scenarios.

Compute the probability that two machines hang and data is fully available or not lost (no unavailable or lost data shards exist): 17_

The simple derivation process is listed below:

For a particular group, the probability that it is available is 17__ . The probability of N/3 available groups is the power of N/3.

We can see that as the number of machines increases, availability increases rapidly, exceeding 90% at 20 nodes.

Compared with model 1, in model 2, the scheduling between different shards (or different Paxos protocol groups) has a certain binding relationship, making them only linearly related to the number of nodes, which significantly improves the availability.

Note: The group mentioned here is not the Paxos group or Raft group. It is a kind of scheduling. An intuitive description of the relationship between node, group, Paxos/Raft group, and shard.

Node and group: 3 nodes to N groups, N>=1. That means a group contains 3 nodes, and a node may belong to multiple groups.
Group and Paxos/Raft group: 1 group to N, N>=1
Paxos/Raft group and shard: 1 PaxosRaft group to N shards in most cases. 1 Paxos/Raft group to 1 shard in many databases.

In databases with the concept of group, it has different names. For example, for PolarDB Distributed Edition, a group is called DN. Each DN has three nodes (three processes), each DN has a Paxos group, and each DN (Paxos group) has several shards.

Table Group & Local Index

Model 2 significantly reduces the risk of a single-shard failure. However, as we said above, multiple transactions, tables, and indexes in a business request can amplify this unavailability. For businesses with high stability requirements, distributed databases need to provide more capabilities to reduce such risks.

PolarDB Distributed Edition uses table group technology to resolve this issue. The core idea is to bind the data involved in the same service request together as much as possible and schedule it to a similar location so its availability is unaffected by the amount of data involved.

For example, if a service has the concept of user, we can use tools (such as partition tables, table groups, and local indexes) to bind multiple tables, records, and indexes involved in a user to the same DN. The tables include user details, operation logs, and address lists.

Self-Built vs. Cloud Services

Self-built distributed databases on the cloud have some additional problems.

Typically, we purchase a batch of ECSs and deploy distributed databases on the ECSs.

A big risk here is that ECSs are virtualized (unless we purchase an ultra-large specification of an exclusive physical machine), and multiple ECSs will be virtualized on one physical machine. Ordinary cloud users cannot perceive the scheduling strategy of ECSs.

For example, the ECSs that contain multiple nodes or multiple copies of the same data may be located on the same physical machine.

The physical distribution of nodes cannot be perceived or controlled by self-built databases.

In addition, self-built databases are prone to the following problems:

IO and network bandwidth of multiple database nodes on the same physical machine compete with each other.
The purchased ECS with the same CPU memory specification may correspond to different CPU models and have different performances.

The preceding issues significantly increase the risks and probability of failures of self-built distributed databases on the cloud.

Generally, there are several solutions:

Three Available Zone Deployments: ECSs in different zones must be on different physical machines. We can ensure the three copies of the same shard are not on the same physical machine using the locality and other capabilities within the database. The disadvantages are listed below:

a) Still unable to control the physical distribution in the same zone

b) Changes in response time due to multiple zones

Purchase ECSs with Independent Physical Machines: We can manage the virtualization, isolation, and node distribution of physical servers by ourselves. The disadvantages are listed below:

a) Expensive

b) High O&M Requirements

c) Still unable to control racks, switches, etc.

Distributed database services provided by cloud vendors can solve such problems. For example, the PolarDB distributed edition service on the cloud can do the following:

All nodes in the same cluster must be on different physical machines and racks.
All nodes in the same cluster use the same CPU of physical machines. Self-built databases on the cloud face more stability risks.

Such risks increase exponentially in self-built distributed databases on the cloud.

Summary

Model 1 is clean and easy to layer. Each shard can be scheduled freely without considering the location relationship between indexes. Therefore, Model 1 is especially suitable for toy-level projects in the database field. However, when used in relational databases, it is an empty shell because a large number of tables and secondary indexes will dramatically amplify this effect. In the face of multi-machine failure, the probability of no problem will rapidly drop to zero as the amount of data and the number of nodes grow.

Model 2 is more suitable for production-level databases. If you want to use a distributed database, it is recommended to carefully identify whether it has a scheduling strategy similar to Model 2.

Table groups, local indexes, and other capabilities are indispensable for production-level databases. Proper use of these capabilities can avoid exponentially amplifying unavailability. We recommend choosing a database with these capabilities for applications that require higher stability.

Now, we can answer the questions at the beginning of the article.

Paxos/Raft protocol is fundamental. There is no way of talking about data security if there are problems in the protocol.
Protocols only are far from enough. If there is no scheduling policy, table group, local index, or other capabilities with a binding relationship, the availability seen by the final business layer will be significantly affected.
The unavailability of the KV layer is exponentially amplified by the relational model and the logic of the business use of the database.

Stay tuned. We will bring more thoughts on the design and solution selection of production-level databases in subsequent articles.

Community

About Database Kernel | What Happens to Distributed Databases When Two Machines Are Hung Up?

Start with 1+1=2

Scheduling Is a Mapping Game

Model 1: A Completely Random Scheduling

Data Availability = Business Availability

Model 2: Scheduling with Certain Binding Relationships between Shards

Table Group & Local Index

Self-Built vs. Cloud Services

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for Xscale

PolarDB for PostgreSQL

PolarDB for MySQL

LedgerDB