×
Community Blog PolarDB-X Best Practice Series (6): How to Write Data in Batches

PolarDB-X Best Practice Series (6): How to Write Data in Batches

This article explains the importance and methods of performing batch operations.

Batch operations (including Batch Insert, Batch Update, and Batch Delete) are common techniques in databases with several main functions:

  1. They can effectively reduce the processing time of a single record (usually from milliseconds to 0.1 milliseconds). In applications that cannot improve the throughput by adding more concurrency (for example, some businesses require single-thread consumption of MQ), Batch operations can effectively improve the throughput.
  2. They can merge the commit of the transaction logs to reduce the load on the database.

However, for various reasons (mainly issues with the official MySQL driver), when using PolarDB-X and MySQL, there is often a situation where "I thought it was a Batch operation, but it still executed one by one."

This article will introduce how to achieve Batch operations in PolarDB-X (also applicable to MySQL).

Multi-statements and Batch

In PolarDB-X (MySQL), there are two ways to commit a bunch of data to the database simultaneously. The one written in multiple SQL statements is called "multi-statement," and the one written in a single SQL statement is called "Batch".

For example, this is a multi-statement (essentially multiple SQL statements):

INSERT INTO t1 (c1,c2) VALUES (1,1);INSERT INTO t1 (c1,c2) VALUES (2,2);

This is a Batch (essentially still a single SQL statement):

INSERT INTO t1 (c1,c2) VALUES (1,1),(2,2);

The differences are as follows:

  1. Each clause of a multi-statement must be executed in series (the standard is defined in this way, and dependencies are allowed between each clause). Only when one clause is executed can the next one be executed. Therefore, multiple statements can only save the network RTT between the client and the database (just one RTT is needed) but cannot save some overheads within the databases (for example, each clause of the write operations needs at least one RTT between the primary and secondary databases).
  2. Batch operations allow databases to write records in parallel (for PolarDB-X, each record is written in parallel; for MySQL, each is written serially). The entire Batch has only one RTT between the primary and secondary databases (between the leader and follower in PolarDB-X).

1
Typical Network Latency Between Components of PolarDB-X in Three Zones

Take a typical PolarDB-X deployed in three zones as an example. When inserting 1,000 records:

  1. If using multi-statements, only the network latency is considered. The theoretical time consumption is at least (0.1ms+(0.1ms+0.8ms+0.8ms+0.1ms)*1000+0.1ms)=1800.2ms.
  2. If using Batch, only the network latency is considered. The theoretical time consumption is at least (0.1ms+(0.1ms+0.8ms+0.8ms+0.1ms)+0.1ms)=2ms (of course, it will definitely be more than 2ms in practice, but will be much less than 1800ms).

Both modes seem to commit data to the database at once, but in fact, the effects are very different. We should use Batch operations as much as possible instead of multi-statements.

However, MySQL drivers (such as mysql-connector-java), ORM frameworks (such as MyBatis), and similar tools can easily, if not carefully managed, convert what you believe to be Batch operations into multi-statement executions (or even single-statement executions). Below, we will introduce the correct methods for performing Batch operations.

Batch Insert

The Batch Insert operation is straightforward and has only one format:

INSERT INTO t1 (c1,c2) VALUES (1,1),(2,2);

There are several ways to implement it in Java:

First, use the addBatch() and executeBatch() interfaces of PreparedStatement (note, not addBatch(String sql), which is a multi-statement):

Connection conn = xxx;
PreparedStatement pstmt = conn.prepareStatement("INSERT INTO t1 (c1,c2) VALUES (?,?)");

pstmt.setInt(1, 1);
pstmt.setInt(2, 1);
pstmt.addBatch();

pstmt.setInt(1, 2);
pstmt.setInt(2, 2);
pstmt.addBatch();

pstmt.executeBatch();

Does it seem simple? However, this method relies on mysql-connector-java to concatenate the SQL and parameters into a Batch format. This conversion only occurs if rewriteBatchedStatements=true (false by default) is set in the JDBC URL. Otherwise, mysql-connector-java will execute the two INSERT SQL statements as multi-statements.

Similarly, if frameworks such as MyBatis are used, and the SQL template defined in the mapper is in a similar format. In that case, this parameter also controls whether it executes as a Batch.

To avoid pitfalls, the second method is more recommended.

Second, applications (including MyBatis) directly construct the statement in Batch format. This way, the statement received by mysql-connector-java is already in Batch format, and it does not rely on the value of rewriteBatchedStatements. Take MyBatis as an example:

<insert id="batchInsert">
  INSERT INTO t1 (c1, c2) VALUES
  <foreach collection="list" item="item" separator=",">
     (#{item.c1}, #{item.c2})
  </foreach>
</insert>

Batch Update

The most common error is to use the addBatch() and executeBatch() interfaces of PreparedStatement, for example:

Connection conn = xxx;
PreparedStatement pstmt = conn.prepareStatement("UPDATE t1 SET c1 = ? WHERE id = ?");

pstmt.setInt(1, 1);
pstmt.setInt(2, 1);
pstmt.addBatch();

pstmt.setInt(1, 2);
pstmt.setInt(2, 2);
pstmt.addBatch();

pstmt.executeBatch();

mysql-connector-java does not support rewriting for UPDATE statements. Therefore, this form will definitely be executed as multi-statements (or even single statements: in the JDBC URL, if allowMultiQueries=true, it will be a multi-statement; if allowMultiQueries=false (false by default), it will be a single statement), and it is not a Batch operation.

Batch Update has two correct forms:

First, use CASE WHEN statements, for example:

UPDATE t1
  SET
  c1 = CASE id
    WHEN 1 THEN 1
    WHEN 2 THEN 2
  END,
  c2 = CASE id
    WHEN 1 THEN 1
    WHEN 2 THEN 2
  END
WHERE id IN (1,2)

Use MyBatis, for example:

<update id="batchUpdate">
  UPDATE t1
  SET
  c1 = CASE id
  <foreach collection="list" item="item" separator=" ">
    WHEN #{item.id} THEN #{item.c1}
  </foreach>
  END,
  c2 = CASE id
  <foreach collection="list" item="item" separator=" ">
    WHEN #{item.id} THEN #{item.c2}
  </foreach>
  END
  WHERE id IN
  <foreach collection="list" item="item" separator="," open="(" close=")">
    #{item.id}
  </foreach>
</update>

The problems with this way are:

o The statement becomes lengthy and unattractive.

o The IN operation that uses the primary key in the WHERE clause can easily lead to full table scans.

Second, use INSERT ... ON DUPLICATE UPDATE (also known as upsert), for example:

INSERT INTO t1 (id,c1) VALUES (1,1), (2,2) ON DUPLICATE KEY UPDATE c1 = VALUES(c1)

Use MyBatis, for example:

<update id="batchUpdate">
  INSERT INTO t1 (c1, c2) VALUES
  <foreach collection="list" item="item" separator=",">
     (#{item.c1}, #{item.c2})
  </foreach>
  ON DUPLICATE KEY UPDATE c1 = VALUES(c1)
</update>

This form is similar to Batch Insert.

Note:

o Columns that are not being updated do not need to be included in the VALUES or UPDATE clauses (even if they are not allowed to be NULL).

o For partitioned tables, include the partition key in the VALUES clause (though it is not required in the UPDATE clause).

o This method has certain limitations. It requires that the updated data must exist (if it does not exist, it will insert a new record, which may not be consistent with the business semantics).

o This form can also be implemented using rewriteBatchedStatements=true in Batch Insert, but it is not recommended.

This form of SQL is more refined and efficient.

Batch Delete

Batch Delete is relatively straightforward, with only one method:

DELETE a FROM t1 a FORCE INDEX (PRIMARY) WHERE id IN (1,2)

For partitioned tables, it is recommended to include the partition key:

DELETE a FROM t1 a FORCE INDEX (PRIMARY) WHERE (id,c2) IN ((1,1),(2,2))

Note the following points:

  1. Control the number of items in the IN clause. Due to the full table scan caused by too many items, it is easy to lock the table. It is recommended to keep it under 500.
  2. Add FORCE INDEX (PRIMARY).
  3. To enable the second point, an alias must be given to t1 due to a peculiar limitation in MySQL:

• "single-table delete" does not support FORCE INDEX; "mult-table delete" supports FORCE INDEX.

• As long as you add an alias to the table, the DELETE operation belongs to the "multi-table delete".

How to determine whether Batch execution is used

No matter which form is used, it is recommended to determine whether Batch execution is used through the following methods:

  1. View SQL audit logs.
  2. If SQL audit is not enabled, you can check the PROCESSLIST during business traffic (in PolarDB-X, you can execute: SHOW FULL PROCESSLIST WHERE INFO IS NOT NULL) to observe if the executed SQL is in batch form.
  3. If you have experience, you can judge by the response time. If a batch operation of 1,000 records is executed for several seconds, it is likely that the Batch is not executed.
0 1 0
Share on

ApsaraDB

501 posts | 149 followers

You may also like

Comments