[MySQL]A New Feature of MySQL • 8.0.0: Persistent Auto Increment Value
Created#More Posted time:Oct 14, 2016 14:18 PM
This is one of the ancient bugs MySQL8.0 has fixed. The bug was found by the CEO of Percona in 2003 (At that time Percona didn’t exist),bug#199. Seeing only the number of this bug, we can feel a strong air of “ancient times” blowing by.
The nature of the problem is the very way InnoDB initializes AUTO_INCREMENT. Each time when it reboots, the maximum auto-increment in the table is always calculated as the maximum value, from which the next allocation starts. This means that if many records at the right leaf node of the btree are deleted, the auto-increment could be reused after InnoDB reboots. In many scenarios it could cause problems, including but not limited to: active-standby switch, historical data migration, etc. From a bunch of feedback below bug#199, we can see a lot of complaints from our peers.
From a very early version onwards, Alibaba’s MySQL solved this problem. Its main idea is to take an unused long integer field from the btree root page (PAGE_MAX_TRX_ID of the page header) and then to make the value of the auto-increment in the current table persistent (Fortunately InnoDB does not support multiple auto-increment columns at the moment). Since the root page of general tables is resident in memory, the effect of pure memory operations on performance is almost negligible.
The official fix is more elegant. Instead of changing any existing storage, it fixes the bug through redo log. This patch is realized on a basis of WL#7816 architecture. If we want to understand it, then we have to find out what changes WL#7816 has made
According to the description of the Worklog, when InnoDB finds a damaged index, it will mark it with a “corruption” flag and make it persistent in the internal data dictionary as well as in the persistent storage. But while a new global data dictionary (DD) is placed in the upper layer of the storage engine, any attempt to update the data dictionary from the underlying engine may lead to a deadlock. It also looks strange to transfer the “corruption” information up layer by layer.
To solve this problem, InnoDB manages the persistence of “corruption” mark itself within the engine, using a private engine system table and a specific redo log. The general idea is,
1. When an index corruption is found, a redo log is written in without updating the data dictionary.
2. A private InnoDB engine system table called DD Buffer Table is introduced. Each time before checkpoint, the index corruption bit will be saved in it.
3. At the time of crash recovery, the index corruption bit is read simultaneously from both the redo log and the DD Buffer Table; then the result is combined and both the table and the index object in the memory are marked.
In this worklog, the persistence of the corruption bit is to be addressed. But the framework of implementation is also applicable to other purposes, such as update_time, auto_inc, count(*), etc. So the code has undergone a generic abstraction.
The Initialization of Persister
Currently, Persister has only 2 types. One is used for the persistence of the corruption bit, the other for the persistence of the auto increment column. The corresponding class is:
Persister corresponds to dict_persist_t::persisters, the global object. We can find the corresponding Persister through persistent_type_t. Currently, we only have PM_INDEX_CORRUPTED and PM_TABLE_AUTO_INC. However, the notes suggest that more extensions will be made in the future.
At the startup stage, Persister calls the function dict_persist_initfor initialization.
A new system table
The new system table is SYS_TABLE_INFO_BUFFER, whose corresponding management class is DDTableBuffer. The pointer is stored in dict_persist->table_buffer.
Table ID is DICT_TBL_BUFFER_ID, whose value is 0xFFFFFFFFFF000000ULL. ROOT PAGE is the 8th page (FSP_TBL_BUFFER_TREE_ROOT_PAGE_NO) of ibdata.
The system table contains two columns: TABLE_ID and METADATA of BLOB type (ref DDTableBuffer::init). The METADATA column contains all the metadata that needs to be persisted.
When the index is found corrupted, dict_set_corrupted is called to mark the index corruption and the log is written(Persister::write_log):
• The written contents include space ID and index ID.
• The log format is:
| 1byte: type = MLOG_TABLE_DYNAMIC_META
| Table ID
| 1byte: SUB-TYPE: PM_INDEX_CORRUPTED
| 1byte: Num: Number of corrupted indexs
| 4bytes: space id
| 8bytes: index id
## The structure is a bit strange. In theory, the same table index should exist in the same space. Here only one space ID record is needed.
Once the log is written, a log flush will occur, and the log will be persisted into the disk. As index corruption is a low probability event, it will not cause performance problems.
Then the table state is set dirty (dict_table_mark_dirty). Here we define three states for the table:
METADATA_CLEAN: In the DDTableBuffer table there is no cached data.
METADATA_BUFFERED: In the DDTableBuffer system table, there is at least one column of data, which is to be written back into the DD in the future.
METADATA_DIRTY: Some persistent metadata is modified in memory and needs to be written back into the DDTableBuffer.
If the current table state is METADATA_CLEAN, the object needs to be added into the global linked table dict_persist_t::dirty_dict_tables. The linked table used for maintenance state is the table object of METADATA_DIRTYor METADATA_BUFFERED.
After dirty_status calling dict_table_mark_dirty, it is set to METADATA_DIRTY and will be in the dict_persist_t::dirty_dict_tables linked table.
When the persistence of the AUTOINC column is executed at the time of insertion or update, note that the persistence of the temporary table is not needed.
Before the clustered index record is inserted (row_ins_clust_index_entry_low), the counter will be taken out of the entry, and then recorded in the log.
When the record is updated (row_upd_clust_rec), if there is an autoinc column in the table and it has been updated to a larger value (row_upd_check_autoinc_counter), an attempt to record it in the log will occur.
The log writing function of the AUTOINC persistence is AutoIncLogMtr::log. When the new counter is greater than the persisted dict_table_t::autoinc_persisted, update autoinc_persisted to a new counter and set the dirty_status of the table dirty (If necessary).
The recorded log format is
| 1byte: type = MLOG_TABLE_DYNAMIC_META
| Table ID
| 1byte: Sub-type: PM_TABLE_AUTO_INC
| Autoinc Counter
Note that after you write the data to the log, no flush log operation is performed for performance reasons. Therefore, if it crashes, we cannot guarantee that the assigned autoinc will not be reused. But from the user's perspective (transaction level), the autoinc will not be reused.
Write back the DDTableBuffer
In several cases, the memory modification will be written back into the DDTableBuffer:
1. Before carrying out the checkpoint (log_checkpoint), all the table objects corresponding to the persist metadata in the dirty_dict_tables linked table need to be written back into the DDTableBuffer (dict_persist_to_dd_table_buffer).
2. When a table object is removed from memory (dict_table_remove_from_cache_low), an attempt to write back will be made if necessary.
3. After the DDL is added to the table that contains the auto increment column, the persistence of the counter is needed. In the following functions, we will calldict_table_set_and_persist_autoinc:
// for example: alter table..auto_increment = ??
// rename from temporary table to normal table
The process of writing back is relatively simple (dict_table_persist_to_dd_table_buffer_low):
1. The Metadata data that needs to be written back through the initialization of the table object: The corrupt index and autoinc value (dict_init_dynamic_metadata)
2. Build record values and insert the DDTableBuffer system table (DDTableBuffer::replace(). If the record exists, a pessimistic update operation is to be performed.
3. The dirty_status of the table object is modified to METADATA_BUFFERED, which indicates the existence of the buffer metadata.
Recovery and Startup
In the crash recovery, when the log of MLOG_TABLE_DYNAMIC_META is parsed (MetadataRecover::parseMetadataLog), the parsing process will continue and the parsed data will be stored in the collection (MetadataRecover::m_tables). If the items with the same table-id exist, they should be replaced to ensure they are always up-to-date.
When the recovery process is completed, the collected meta information is to be stored temporarily in srv_dict_metadata. Then start the apply process (srv_dict_recover_on_restart). The apply process is simple: load the table objects and then update them (MetadataRecover::apply). For example, for the autoinc column, we always need to choose the larger value.
To finish up…
See the code commit dcb8792b371601dc5fc4e9f42fb9c479532fc7c2 for details.
This bug has been open for quite a long time. Those guys who see it as one of "characteristics" of InnoDB must note this change.