What do I do if I fail to obtain the MDL to execute a DDL statement - PolarDB

This topic describes how to troubleshoot the error message "ERROR HY000: Fail to get MDL on replica during DDL synchronize" when you execute a DDL statement in PolarDB for MySQL.

Problem description

The following error message is returned when I execute a DDL statement in PolarDB for MySQL:

ERROR HY000: Fail to get MDL on replica during DDL synchronize

Cause

A read-only node of the PolarDB for MySQL cluster has ongoing queries or uncommitted transactions.

Solution

To resolve this issue, PolarDB for MySQL 8.0.1 and 8.0.2 support the preemptible DDL feature. For more information about the feature, see Preemptible DDL.
Commit (commit) or roll back (rollback) the transactions.
You can also set the polar_slave_work_on_nonblock_mdl_mode parameter to ON on the read-only node to prevent long-running transactions that are not committed from blocking DDL operations. For more information, see Prevent long-running transactions on read-only nodes from blocking DDL operations.
Execute the DDL statement.
- If the DDL statement is executed, the task ends.
- If the DDL statement fails to be executed, proceed to the next step.
If the Polar performance schema feature is enabled for your cluster, you can use the feature to query the MDL status in a specified table and then kill the threads related to uncommitted transactions on the read-only nodes. Perform the following operations:
Note
- You can execute the following SQL statement to check whether the Polar performance schema feature is enabled for the cluster:
```
SHOW VARIABLES LIKE 'polar_performance_schema';
```
- If the uncommitted transaction on the read-only node is very important, we recommend that you do not directly kill the transaction. Wait until the transaction is complete before you execute the DDL statement.
- Enable the Polar performance schema feature. For more information about how to enable the Polar performance schema feature, see Procedure.
- If the threads cannot be killed, the thread IDs do not exist, or the following error occurs. Submit a ticket to contact technical support.
```
ERROR 1094 (HY000): Unknown thread id: xxx
```
Use the hint to specify a read-only node on which you can execute the following SQL statement to query the MDL status on the specified table.
```
/*force_node='pi-bp10k7631d6k3****'*/ SELECT t.PROCESSLIST_ID, m.OBJECT_TYPE, m.OBJECT_SCHEMA, m.OBJECT_NAME, m.LOCK_TYPE, m.LOCK_DURATION, m.LOCK_STATUS FROM performance_schema.metadata_locks m LEFT JOIN performance_schema.threads t ON m.owner_thread_id=t.thread_id;
```
Sample result:
The result indicates that the test/t table holds the lock with LOCK_TYPE set to SHARED_READ because a large query or uncommitted transactions occur on the table. The test/t table also holds a lock in the PENGING state with LOCK_TYPE set to EXCLUSIVE. Use the hint to specify a read-only node on which you can execute the /*force_node='pi-bp10k7631d6k3****'*/ kill 14 statement to kill the thread.
If the Polar performance schema feature is disabled for the cluster, you can query the MDL status on the information_schema.innodb_trx table to determine whether any transactions are not committed. If yes, kill the threads. Perform the following operations:
Use the hint to specify a read-only node on which you can execute the following SQL statement to query the MDL status on the information_schema.innodb_trx table.
```
/*force_node='pi-bp10k7631d6k3****'*/ SELECT * FROM information_schema.innodb_trx\G
```
Sample result for the failed DDL execution caused by large queries:
The result indicates that the t1 table holds the MDL because a large query occurs on the t1 table. In this case, use the hint to specify a read-only node on which you can execute the /*force_node='pi-bp10k7631d6k3****'*/ kill 6 statement to kill the trx_mysql_thread_id thread.
Sample result for the failed DDL execution caused by long-running transactions:
The result indicates that transaction 14 is not committed. However, it cannot be determined that the current transaction causes the table to hold the MDL because the trx_query field is empty. You can perform operations based on the trx_started field value. If the time in the trx_started field is significantly different from the current time, it is very likely that transaction 14 causes the table to hold the MDL. In this case, use the hint to specify a read-only node on which you can execute the /*force_node='pi-bp10k7631d6k3****'*/ kill 14 statement to kill the trx_mysql_thread_id thread.

Additional information

The execution process of a DDL statement in PolarDB:

If you want to change the schema of a table in the execution process of a DDL statement, the primary node obtains the MDL and then writes a redo log.
When a read-only node parses the redo log, it tries to obtain the MDL in the same table. The following situations may occur:
- If the read-only node obtains the MDL, the next step starts.
- Otherwise, the read-only node returns the error to the primary node.
The primary node waits until all read-only nodes are synchronized to the latest log sequence number (LSN). Within a period of time, the primary node determines whether all read-only nodes have parsed the redo log and obtained the MDL. The following situations may occur:
- All read-only nodes are synchronized to the latest LSN, the DDL statement is executed as expected.
- Some read-only nodes are not synchronized to the latest LSN. The DDL statement is rolled back and an error is reported. Two types of errors may occur: one is due to synchronization timeout, and the other is due to failure to obtain the MDL. The main cause for failure to obtain the MDL is that ongoing queries or uncommitted transactions occur on the read-only node.