×
Community Blog The Issue of B-tree Height with Large Data Volumes in a MySQL Single Table

The Issue of B-tree Height with Large Data Volumes in a MySQL Single Table

This article introduces why MySQL B-tree height stays within 4–5 layers even with tens of billions of rows, debunking the myth that single tables must be sharded after 5 million rows.

By Baotiao

Some veteran Database Administrators (DBAs) may remember that in the early days, it was rumored that a single table in MySQL should not exceed 5 million rows. If a single table exceeded 5 million rows, sharding was required. Many DBAs are concerned that after a MySQL table becomes large, the B-tree height will become very large, thereby affecting instance performance.

In fact, a B-tree is a very flat tree. The vast majority of B-trees do not exceed 4 layers. Let us look at the actual situation.

We take the common Sysbench table as an example.

CREATE TABLE `sbtest1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=10958 DEFAULT CHARSET=latin1

In InnoDB, there are mainly two types of pages: leaf pages and non-leaf pages.

The format of a Leaf Page is as follows. Each record is mainly composed of a Record Header and a Record Body. The Record Header is mainly used to coordinate with Data Dictionary (DD) information to interpret the Record Body. The Record Body is the main content of the record.

1
2

For a table such as Sysbench in a 16 KB page, the number of rows that can be stored in a Leaf Page is approximately:

(16 × 1024 - 200 (Length of some Page Headers, tail, and Directory slot)) / ((4 + 4 + 120 + 60) row data length + 5 (header of each row of data) + 6 (XID) + 7 (Roll Pointer)) = 78.5

The format of a Non-leaf Page is as follows:

3

Because the Sysbench primary key ID is an INT of 4 bytes, the number of rows that a 16 KB page can store is:

(16 × 1024 - 200) / (5 (Row data Header) + 4 (Cluster Key) + 4 (Child Page Number)) = 1233

Then, the calculation formula for different heights is as follows:

Height Non-leaf pages Leaf pages Rows Size
1 0 1 79 16KB
2 1 1233 97407 19MB
3 1234 1520289 120102831 23GB
4 1521523 1874516337 148086790623 27.9TB

You can see from the above that if it is a table such as Sysbench, the height of the B-tree will not exceed 4 layers even if the single table has 140 billion rows and the data size is 27.9 TB. Therefore, there is no need to worry about the increase in B-tree height after the data volume becomes large.

Here, if the primary key of Sysbench is BIGINT, which is 8 bytes, what is the situation?

The number of record rows that can be stored in a leaf page is:

(16 * 1024 - 200) / ((8 + 4 + 120 + 60) + 13) = 78.9

You can see that this leaf page record number does not change much.

The number of records that a non-leaf page can store changes slightly more:

(16 * 1024 - 200)/(5+8+4) = 952

Height Non-leaf pages Leaf pages Rows Size
1 0 1 79 16KB
2 1 952 75208 15MB
3 953 906304 71598016 13.8GB
4 907257 862801408 68161311232 12.8TB

You can see from the above that if the primary key of Sysbench is changed to BIGINT, a 4-layer B-tree can store 60 billion rows, which is approximately 12 TB of data.

If a table such as Sysbench is not representative, let us look at some more complex tables, such as the commonly used log tables in the Software as a Service (SaaS) scenarios within Polarbench (a tool used to simulate database usage scenarios in various industries).

CREATE TABLE `prefix_off_saas_log_10` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `saas_type` varchar(64) DEFAULT NULL,
  `saas_currency_code` varchar(3) DEFAULT NULL,
  `saas_amount` bigint(20) DEFAULT '0',
  `saas_direction` varchar(2) DEFAULT 'NA',
  `saas_status` varchar(64) DEFAULT NULL,
  `ewallet_ref` varchar(64) DEFAULT NULL,
  `merchant_ref` varchar(64) DEFAULT NULL,
  `third_party_ref` varchar(64) DEFAULT NULL,
  `created_date_time` datetime DEFAULT NULL,
  `updated_date_time` datetime DEFAULT NULL,
  `version` int(11) DEFAULT NULL,
  `saas_date_time` datetime DEFAULT NULL,
  `original_saas_ref` varchar(64) DEFAULT NULL,
  `source_of_fund` varchar(64) DEFAULT NULL,
  `external_saas_type` varchar(64) DEFAULT NULL,
  `user_id` varchar(64) DEFAULT NULL,
  `merchant_id` varchar(64) DEFAULT NULL,
  `merchant_id_ext` varchar(64) DEFAULT NULL,
  `mfg_no` varchar(64) DEFAULT NULL,
  `rfid_tag_no` varchar(64) DEFAULT NULL,
  `admin_fee` bigint(20) DEFAULT NULL,
  `ppu_type` varchar(64) DEFAULT NULL,
  PRIMARY KEY (`id`),
   KEY `saas_log_idx01` (`user_id`) USING BTREE,
  KEY `saas_log_idx02` (`saas_type`) USING BTREE,
  KEY `saas_log_idx03` (`saas_status`) USING BTREE,
  KEY `saas_log_idx04` (`merchant_ref`) USING BTREE,
  KEY `saas_log_idx05` (`third_party_ref`) USING BTREE,
  KEY `saas_log_idx08` (`mfg_no`) USING BTREE,
  KEY `saas_log_idx09` (`rfid_tag_no`) USING BTREE,
  KEY `saas_log_idx10` (`merchant_id`)
  ) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8

Because there are variable-length fields here, but most refs have values, we assume the case where VARCHAR fields are fully used.

When all these fields are added together and the Record Header information is additionally calculated, it is approximately 974 bytes.

Then, the number of records that a Leaf Page can store is (16 × 1024 - 200) / 974 = 16.6

For a Non-Leaf Page, as with the previous Sysbench BIGINT, the number of records that can be stored is 952.

Height Non-leaf pages Leaf pages Rows Size
1 0 1 16 16KB
2 1 952 15232 15MB
3 953 906304 14500864 13.8GB
4 907257 862801408 13804822528 12.8TB

You can see that even for a table where a single row is approximately 1 KB, if the primary key is still BIGINT, the height of the B-tree will definitely be within 4 layers if the data is within 10 TB. At the same time, within 4 layers, this table can store approximately 13.8 billion rows.

Therefore, scenarios where MySQL stores billions of rows work well.

MySQL still has a best practice: "It is not recommended to use Universally Unique Identifier (UUID) as the primary key." Let us examine why.

For example, regarding the preceding prefix_off_saas_log_10, if the primary key is changed to a 32-byte UUID, assuming the Leaf Page remains unchanged,

The record number stored in the Non-Leaf Page:

(16 * 1024 - 200)/(5+32+4) = 394

Height Non-leaf pages Leaf pages Rows Size
1 0 1 16 16KB
2 1 394 6304 6MB
3 395 155236 2483776 2GB
4 155631 61162984 978607744 981GB
5 61318615 24098215696 385571451136 386TB

As can be seen from the preceding Table, if UUID is used as the primary key, then for the same 4-layer B-tree, if BIGINT is used, 13.8 billion rows of data can be stored, whereas using UUID allows storing only 970 million rows of data.

However, even if UUID is incorrectly used as the primary key, the depth of the MySQL B-tree will not exceed 5 layers. 5 layers can store up to 380 billion rows, or 386 TB of data. In fact, this is impossible because a MySQL single table actually supports a maximum of 64 TB.

Overall, in MySQL, there is no need to worry that the increase in B-tree height will impact performance after the data volume increases. For Data within 10 TB, the B-tree height will definitely be within 4 layers. After exceeding 10 TB, it will also stay at 5 layers and will not go higher because a MySQL single table supports a maximum of 64 TB.

PolarDB supports many large table production instances. Large tables of 10+ TB are actually very common. I have also seen actual sharing from many Database Administrator (DBA) friends at major companies previously. In fact, if the business table schema design is reasonable, large tables are fully supported. There is no need to follow legacy guidelines that no longer apply.

0 1 0
Share on

ApsaraDB

594 posts | 182 followers

You may also like

Comments

ApsaraDB

594 posts | 182 followers

Related Products