This topic describes the performance of PolarDB-X for an ultra-large transaction.

For distributed databases, an ultra-large transaction meets one or more of the following requirements:

  1. Data in multiple shards is updated in the transaction.
  2. A large amount of data is updated in the transaction.
  3. A large number of SQL statements are executed in the transaction.

In this topic, multiple scenarios that meet one or more of the preceding requirements are described to test the performance of PolarDB-X for an ultra-large transaction.

Instance specification for the tests

The following table describes the PolarDB-X instance that is used for the tests in this topic:

PolarDB-X version polarx-kernel_5.4.11-16301083_xcluster-20210805
Node specification 4 CPU cores and 16 GB of memory
Number of nodes 4

Create an Elastic Compute Service (ECS) instance that has 4 CPU cores and 16 GB of memory to perform the tests. The ECS instance and the PolarDB-X instance are located in the same CIDR block and connected by the same vSwitch.

Run the following command to create a table named tb:

CREATE TABLE `tb` (
    `id` bigint(20) NOT NULL,
    `c` longblob/char(255)/char(1),
    PRIMARY KEY (`id`)
);

This table contains two columns: the id column and the c column. The id column is the primary key column of the tb table. The data type of the id column is bigint. The data type of the c column is longblob, char(255), and char(1) in different scenarios. The table is split into the following numbers of shards in different scenarios:

  • One shard: The table contains only one shard. PolarDB-X optimizes a transaction commit in a shard as a one-phase commit.
  • Eight shards: The table is split into eight shards based on the id column, with two shards in each data node.
  • Sixteen shards: The table is split into sixteen shards based on the id column, with four shards in each data node.

Scenario 1

In this scenario, a moderate number of SQL statements that involve a large amount of data are executed. The total amount of data written to the table is large.

Number of executed SQL statements 2,048
Amount of data contained in each SQL statement 256 KB to 8 MB
Amount of updated data 512 MB to 16 GB
Number of updated data pieces 2,048
Procedure

All SQL statements that are executed in this scenario are in the following format:

INSERT INTO `tb` VALUES (id, c)

The data type of the c column is longblob. The data pieces in the column range from 256 KB to 8 MB in size. Therefore, when 2,048 SQL statements are executed in the transaction, 512 MB to 16 GB of longblob data in total is written to the table. The actual amount of data written to the table is larger than the amount of the written longblob data because data in the id column, which is of the bigint type, is also written to the table.

Notice The size of a query request cannot exceed the limit of a compute node, which is specified by the MAX_ALLOWED_PACKET parameter. Therefore, the amount of data contained in each SQL statement cannot exceed the value of MAX_ALLOWED_PACKET. By default, the value of this parameter is 16 MB and can be modified by using Parameter settings. If you connect to a PolarDB-X database by using JDBC and send a packet whose size is larger than the value of MAX_ALLOWED_PACKET in byte, the "CommunicationsException: Communications link failure" error is reported.

For a non-sharded database, if the size of each data piece written to the c column is 1 MB, the following error is reported when the total amount of written data in the transaction exceeds 2 GB:

ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP ...: Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase this mysqld variable and try again 

In a transaction, the amount of data written to binlogs cannot exceed the limit of the data node, which is specified by the MAX_BINLOG_CACHE_SIZE parameter. For distributed transactions, each branch transaction is run in one shard, and the amount of data written to binlogs cannot exceed the limit of the data node, which is specified by the MAX_BINLOG_CACHE_SIZE parameter. In this scenario, if the 2,048 SQL statements are all INSERT statements, the amount of data written to each shard cannot exceed 2 GB. Therefore, the total amount of data written to the database in a transaction cannot exceed (number of shards x 2 GB). For example, for a database that has eight shards, if the total amount of data written to the database in the transaction exceeds 16 GB, the preceding error is reported. However, this does not ensure that a transaction can be successful even if the amount of data written to a database in a transaction is smaller than (number of shards x 2 GB).

Notice The default value of the MAX_BINLOG_CACHE_SIZE parameter is 2147483648 (2 GB), which cannot be modified. Even if this value can be modified, the value of MAX_BINLOG_CACHE_SIZE cannot exceed 4 GB due to the limits of MySQL.
Test conclusions
  • Regardless of whether a transaction is started, the amount of data contained in each SQL statement cannot exceed the value of MAX_ALLOWED_PACKET.
  • For each transaction, the amount of data written to binlogs when SQL statements are executed on each shard cannot exceed the limit of the data node. For example, when an INSERT statement is executed to write data to a shard, the data written to the shard is also recorded in binlogs. Therefore, to prevent the amount of data recorded in binlogs from exceeding the value of MAX_BINLOG_CACHE_SIZE, the data that is written to each shard in a transaction cannot exceed this value.
  • If you want to operate a larger amount of data in a transaction, such as insert more data, split the table into more shards.

Scenario 2

In this scenario, a small number of SQL statements are executed. Each statement contains a small amount of data. A large amount of data is written to a table by a transaction.

Number of executed SQL statements 1
Amount of data contained in each SQL statement about 256 KB
Amount of updated data 256 MB to 8 GB
Number of updated data pieces 1,048,576 (2^20) to 33,554,432 (2^25)
Procedure

In this test, the data type of the c column is char (255). Import 33,554,432 pieces of data to the table. Values in the id column are sequential numbers from 0 to 2^26-1. Each value in the c column is a string consisting of 255 consecutive character "a". Then, start a transaction and execute the following SQL statement:

UPDATE `tb` SET c = a string consisting of 255 consecutive character "B" where id < ?

Assume that the ? in the statement indicates a value of x. After the SQL statement is executed, x pieces of data are updated, and 256 bytes of data is updated for each piece of data. Therefore, the total amount of data is x * 2^8 bytes. In this test, the value of x ranges from 1,048,576 (2^20) to 33,554,432 (2^25), and the total amount of updated data is from 512 MB to 8 GB.

For a non-sharded table, if the value of x is 4,194,304 (2^22), the following error is reported when the total amount of updated data in the transaction exceeds about 1 GB:

ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP ...: Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase this mysqld variable and try again 

When an UPDATE statement is executed to update data in a shard, the data before and after the update is all recorded in binlogs. Therefore, if you execute a UPDATE statement to update 256 bytes of data, at least 512 bytes of data is recorded in binlogs. For a non-sharded database, when 1 GB of data is updated, the data recorded in binlogs exceeds 2 GB so that the preceding error is reported. Similarly, for a distributed transaction, if only UPDATE statements are executed, the amount of data updated for each shard cannot exceed 1 GB. For example, if the table is split into eight shards in this scenario, the preceding error is reported when 8 GB of data is updated in a transaction.

Test conclusions
  • Similar to the conclusion of scenario 1, for a transaction, the amount of data written to binlogs when SQL statements are executed on each shard cannot exceed the limit of the data node, which is specified by the MAX_BINLOG_CACHE_SIZE parameter. For example, when a transaction contains only UPDATE statements, the data before and after the update is all recorded in binlogs. In this case, the amount of updated data in each shard cannot exceed half of the value of the MAX_BINLOG_CACHE_SIZE parameter. In this scenario, if the table is not sharded, the amount of data updated in a transaction cannot exceed 1 GB. If the table is split into eight shards, the amount of data updated in a transaction cannot exceed 4 GB.
  • If you want to operate a larger amount of data in a transaction, such as insert more data, split the table into more shards.

Scenario 3

In this scenario, a large amount of SQL statements are executed. Each statement contains a small amount of data. A small amount of data is written by a transaction.

Number of executed SQL statements 64,000 to 1,280,000
Amount of data contained in each SQL statement Several bytes
Amount of updated data 32B
Number of updated data pieces 32
Procedure

In this test, the data type of the c column is char (1). Insert 32 pieces of data into the table. Values in the id column are sequential numbers from 0 to 31. Each value in the c column is the "a" character. Then, start a transaction to run the following statements x times. Each transaction executes64x SQL statements.

UPDATE `tb` SET c = "b" where id = 0;
UPDATE `tb` SET c = "a" where id = 0;
UPDATE `tb` SET c = "b" where id = 1;
UPDATE `tb` SET c = "a" where id = 1;
...
UPDATE `tb` SET c = "b" where id = 31;
UPDATE `tb` SET c = "a" where id = 31;
Test conclusions

In this scenario, the performance of the PolarDB-X instance is not bottlenecked when 1.28 million SQL statements are executed in a transaction regardless of the number of shards into which the table is split. Therefore, in similar scenarios, the performance of a PolarDB-X is more likely to be bottlenecked by the amount of data that is written, deleted, or updated.

Scenario 4

This scenario is designed to test how the number of shards affects transaction performance.

Number of executed SQL statements 1
Amount of data contained in each SQL statement Several bytes
Amount of updated data 8 KB to 8 MB
Number of updated data pieces 8 KB to 8 MB
Number of shards 1 to 2,048
Procedure

In this scenario, the data type of the c column is char (1). Insert x pieces of data into the table. Values in the id column are sequential numbers from 0 to x. Each value in the c column is a character "a". Then, start a transaction and execute the following statements. Each transaction updates x pieces of data and x bytes of data.

UPDATE `tb` SET c = "b" where id < x

Each time the preceding SQL statement is executed in a transaction, x pieces of data are updated and the amount of the updated data is x bytes.

Test conclusions
  • When the amount of updated data is 8 KB, the execution time of SQL statements is short because the amount of data is small. Therefore, the transaction commit time has a great impact on the transaction running time. In this case, the transaction running time increases as the number of shards increases. However, when the number of shards increases within a range from 1 to 64, the transaction running time does not increase at a linear rate with the number of shards. This is because when the number of shards increases within a small range, in addition to the transaction commit time, the execution time of DML statements still has an impact on the transaction running time. When the number of shards increases within a range from 1,024 to 2,048, the transaction running time is determined by the transaction commit time. In this case, compared with the transaction commit time, the execution time of DML statements has little impact on the transaction running time. Therefore, the transaction running time increases at a linear rate with the number of shards.
  • When the amount of updated data is 8 MB, the transaction running time does not increase with the number of shards. Instead, the transaction running time is determined by the execution time of DML statements, and the transaction commit time can be ignored.
  • We recommend that you split the table into more shard when you need to operate a larger amount of data in a transaction. However, the transaction commit time increases when the table is split into more shards. In transactions in which a small amount of data is operated on a large number of shards, the transaction running time is determined by the transaction commit time. Therefore, to achieve a better transaction performance, you can refer to the test conclusions provided in this topic and split tables into an appropriate number of shards based on your actual scenario.

Scenario 5

The scenario is designed to test how the workloads of compute nodes affect transaction performance.

Procedure

In this scenario, the logic table tb is split into 16 shards. In the table, the data type of the c column is char (255). Insert 67,108,864 (2^26) pieces of data into the table. Values in the id column are sequential numbers from 0 to 2^26 - 1, which is the same as the c column. Then, create another table named tb2 whose schema and data are the same as those of the table tb. The size of the table tb2 is about 16 GB, which is the same as that of the table tb. Create a temporary table named tmp whose schema is the same as that of the table tb. Then, start a transaction and execute the following statements.

INSERT INTO tmp
    SELECT tb.id, tb.c
  FROM tb, tb2
  WHERE tb.c = tb2.c AND tb.id > x AND tb2.id > x

You can execute the EXPLAIN statement to view the execution plan of this SQL statement:

LogicalInsert(table="tmp", columns=RecordType(BIGINT id, CHAR c))
  Project(id="id", c="c")
    HashJoin(condition="c = c", type="inner")
      Gather(concurrent=true)
        LogicalView(tables="[000000-000015].tb_VjuI", shardCount=16, sql="SELECT `id`, `c` FROM `tb` AS `tb` WHERE (`id` > ?)")
      Gather(concurrent=true)
        LogicalView(tables="[000000-000015].tb2_IfrZ", shardCount=16, sql="SELECT `c` FROM `tb2` AS `tb2` WHERE (`id` > ?)")

According to the execution plan, this statement selects the data whose id is larger than x from tb and tb2 and performs a hash join on the selected data on the compute nodes. In this statement, you can configure the x parameter to determine the amount of data to select and join on the compute nodes. If x is set to 0, all data in tb and tb2 is selected and joined. In this case, the compute nodes need to process 32 GB of data in total. For you reference, the memory of each compute node is 16 GB.

Test conclusions

PolarDB-X delivers a stable performance in case of high workloads of compute nodes caused by a single transaction that involves a large amount of data, such as a transaction that involves complex join operations that must be performed on compute nodes. Specifically, the transaction running time increases at a linear rate with the data to process without errors.

Summary

  • Regardless of whether a transaction is started, the amount of data contained in each SQL statement cannot exceed the value of the MAX_ALLOWED_PACKET parameter.
  • For each transaction, the amount of data written to binlogs when SQL statements are executed on each shard cannot exceed the limit of the data node, which is specified by the MAX_BINLOG_CACHE_SIZE parameter.
    • By default, the value of the parameter is 2 GB. The maximum value is 4 GB due to the limits of MySQL. For example, for a transaction that contains only INSERT statements, the total amount of data written to the database cannot exceed (number of shards x 4 GB) because binlogs record the inserted value.
    • If you want to operate a larger amount of data in a transaction, such as to insert, delete, or update more data, split the table into more shards.
  • When the amount of updated data is small, the number of shards has a great impact on the transaction running time. Otherwise, the number of shards has little impact on the transaction running time. Therefore, to achieve a better transaction performance, you can refer to the test conclusions provided in this topic and split tables into an appropriate number of shards based on your actual scenario.
  • PolarDB-X delivers a stable performance in case of high workloads of compute nodes caused by a single transaction that involves large amount of data.