MySQL Memory Allocation and Management (Part I)

This article introduces the memory allocation manager at the InnoDB layer and the SQL layer, including ut_allocator, mem_heap_allocator, and MEM_ROOT.

By Huaxiong Song, from the ApsaraDB RDS for MySQL kernel team

1. Memory Allocation Manager at the InnoDB Layer

1.1 ut_allocator

In non-UNIV_PFS_MEMORY compilation mode, UT_NEW calls the original interfaces such as new, delete, malloc, and free to apply for and release memory. In UNIV_PFS_MEMORY compilation mode, ut_allocator encapsulated internally is used for management, and information such as memory tracking is added, which can be displayed through the PFS table.

ut_allocator can be used as the memory allocator of std containers, such as std::map, allowing the internal memory of the container to be allocated through memory traceability provided by InnoDB. The following describes the different memory allocation methods provided by ut_allocator.

#ifdef UNIV_PFS_MEMORY
#define UT_NEW(expr, key) ::new (ut_allocator<decltype(expr)>(key).allocate(1, NULL, key, false, false)) expr
...
#define ut_malloc(n_bytes, key) static_cast<void *>(ut_allocator<byte>(key).allocate(n_bytes, NULL, UT_NEW_THIS_FILE_PSI_KEY, false, false))
...
#else /* UNIV_PFS_MEMORY */
#define UT_NEW(expr, key) ::new (std::nothrow) expr
...
#define ut_malloc(n_bytes, key) ::malloc(n_bytes)
...
#endif

1.1.1 Single-block Memory Allocation

allocate

An extra piece of ut_new_pfx_t data is allocated during memory application (PFS_MEMORY is enabled), which stores information such as the key, size, and owner.

// An extra pfx memory is allocated during application.
total_bytes+=sizeof(ut_new_pfx_t)
// Apply for memory.
...
// The starting address of the memory is returned.
return (reinterpret_cast<pointer>(pfx + 1));

The retry mechanism for memory allocation is added.

for (size_t retries = 1;; retries++) {
  // malloc/calloc for memory allocation.
  malloc(); // calloc()...
  if (ptr != nullptr || retries >= alloc_max_retries) break;
  std::this_thread::sleep_for(std::chrono::seconds(1));
}

deallocate

Release pfx first, then release the memory data.

deallocate_trace(pfx);
free(pfx);

reallocate

Similar to allocate, it recalculates the size and switches to new ut_new_pfx_t(pfx_old--pfx_new).

1.1.2 Large Memory Allocation

allocate_large

Apply for large memory used in buf_chunk_init() and add pfx information. Note that the mmap mode does not consume the real physical memory, and this memory cannot be tracked by using methods such as jemalloc.

pointer ptr = reinterpret_cast<pointer>(os_mem_alloc_large(&n_bytes));
    |->mmap()/shmget()、shmat()、shmctl()
...
allocate_trace(n_bytes, PSI_NOT_INSTRUMENTED, pfx);

deallocate_large

Release the pfx pointer and release the large memory.

deallocate_trace(pfx);
os_mem_free_large(ptr, pfx->m_size);
  |->munmap()/shmdt()

1.1.3 aligned_memory Allocation

The aligned_memory series including aligned_pointer and aligned_array_pointer are encapsulated separately in the code, but its underlying layer is still ut_alloc and ut_free. We will not go into details here. For example, if you use this method to build the log_t structure, the aligned memory can match the sector size during I/O write, thus improving I/O efficiency.

1.2 mem_heap_allocator

Similar to ut_allocator, mem_heap_allocator can also be used as the allocator of stl. However, it should be noted that this type of allocator only provides mem_heap_alloc function for memory application, and there are no memory release, reuse, or merge operations.

class mem_heap_allocator {
...
  pointer allocate(size_type n, const_pointer hint = nullptr) {
    return (reinterpret_cast<pointer>(mem_heap_alloc(m_heap, n * sizeof(T)))); // mem_heap_alloc applies for memory.
  }
  void deallocate(pointer p, size_type n) {}; // Operations such as memory release are null operations.
...
}

1.2.1 mem_heap_t

Data Structure

This data structure is a non-null memory block linked list, which is linearly connected by mem_block_t of different sizes. Let's focus on free_block and buf_block. To some extent, these two pointers define the actual location of data storage. Data is stored in the memory pointed to by one of the two pointers depending on the request type. By using mem_heap_t to allocate memory, multiple memory allocations can be merged into a single one, and subsequent memory requests can be performed within the InnoDB engine. This reduces the time and performance overheads caused by frequent calls of the malloc and free functions.

typedef struct mem_block_info_t mem_block_t;
typedef mem_block_t mem_heap_t;
...
/** The info structure stored at the beginning of a heap block */
struct mem_block_info_t {
...
  UT_LIST_BASE_NODE_T(mem_block_t) base; /* Basic nodes in the linked list, which are defined only in the first block. */
  UT_LIST_NODE_T(mem_block_t) list;   /* The block linked list. */
  ulint len;        /*!< The size of the current block. */
  ulint total_size; /*!< The total size of all blocks. */
  ulint type;       /*!< The allocation type. */
  ulint free;       /*!< The available location of the current block. */
  ulint start;      /*!< The starting address of the free function during block construction. (I haven't seen many uses.) */
  void *free_block; /* In the heap containing the MEM_HEAP_BTR_SEARCH type, the heap root is mounted with free_block to apply for more memory space, while for other types, the pointer is null. */
  void *buf_block;  /* Apply memory from the buffer pool and save the buf_block_t pointer, otherwise the pointer is null. */
};

Memory Type

mem_heap_t can be classified into the following types based on the source of the requested memory:

#define MEM_HEAP_DYNAMIC 0 /* The original request. Call ut_allocator for InnoDB memory application. */
#define MEM_HEAP_BUFFER 1 /* Obtain memory from the buffer pool. */
#define MEM_HEAP_BTR_SEARCH 2/* Use memory in free_block. */

More combined allocation modes are defined on this basis, making memory allocation more flexible.

/** Different type of heaps in terms of which data structure is using them */
#define MEM_HEAP_FOR_BTR_SEARCH (MEM_HEAP_BTR_SEARCH | MEM_HEAP_BUFFER)
#define MEM_HEAP_FOR_PAGE_HASH (MEM_HEAP_DYNAMIC)
#define MEM_HEAP_FOR_RECV_SYS (MEM_HEAP_BUFFER)
#define MEM_HEAP_FOR_LOCK_HEAP (MEM_HEAP_BUFFER)

1.2.2 Construction of mem_heap_t: mem_heap_create_func

Build a memory heap structure based on the input size and heap type. The minimum size is 64. We can know from the internal construction logic that the maximum size of a single mem_block is the same as the defined page_size, which generally is 16 KB.

To create mem_heap_t, you first need to build a root node of the linked table that is mentioned earlier. Control the block to create functions. The first parameter specified by mem_heap_create_block is heap=nullptr, which indicates that the block is the first node in mem_heap_t. In the case where the type contains MEM_HEAP_BTR_SEARCH operation bits, a construction failure may occur. The detailed logic and reasons for failure will be presented later.

After the first block is created, set it as the base node and update the linked list information to create root node mem_heap_t.

mem_heap_t *mem_heap_create_func(ulint size, ulint type) {
  mem_block_t *block;
  if (!size) {
    size = MEM_BLOCK_START_SIZE;
  }
  // Create the first block of mem_heap. The first parameter specified is nullptr.
  block = mem_heap_create_block(nullptr, size, type, file_name, line);
  // In the MEM_HEAP_BTR_SEARCH mode, there is a possibility that the construction fails and a null pointer is returned.
  if (block == nullptr) {
    return (nullptr);
  }
  // Due to the possibility of BP resizing, the first block cannot be obtained from BP.
  ut_ad(block->buf_block == nullptr);
  // Initialize the base node of the linked list. If the base is not null, the node is marked as the base node.
  UT_LIST_INIT(block->base, &mem_block_t::list);
  UT_LIST_ADD_FIRST(block->base, block);
    
  return (block);
}

1.2.3 Release of mem_heap_t: mem_heap_free

As mentioned earlier, if the type includes a MEM_HEAP_BTR_SEARCH operation bit, the data may be stored in a memory unit corresponding to the free_block. In this case, you need to release the created free_block separately, and then release blocks on the mem_heap_t linked list one by one in reverse order.

void mem_heap_free(mem_heap_t *heap) {
  ...
// Obtain the last node in the linked list.
  block = UT_LIST_GET_LAST(heap->base);
    
// Release the free_block node that is created in the MEM_HEAP_BTR_SEARCH mode.
  if (heap->free_block) {
    mem_heap_free_block_free(heap);
  }
    
// Release blocks one by one in reverse order.
  while (block != nullptr) {
    /* Store the contents of info before freeing current block
    (it is erased in freeing) */
    prev_block = UT_LIST_GET_PREV(list, block);
    mem_heap_block_free(heap, block);
    block = prev_block;
  }
}

1.2.4 Construction of the Block: mem_heap_create_block

1) Request for the Block

This function is the core of the entire mem_heap_t memory allocation, which implements different memory allocation strategies for different types. The following are the specific examples:

Case 1 - For MEM_HEAP_DYNAMIC or small size, it uses ut_malloc_nokey.
Case 2 - When the type contains MEM_HEAP_BTR_SEARCH and the current block is not the root block, memory is allocated from the memory block that free_block points to.
Case 3 - In other cases, buf_block_alloc is used to allocate memory from the buffer pool.

// case 1
if (type == MEM_HEAP_DYNAMIC || len < UNIV_PAGE_SIZE / 2) {
  ut_ad(type == MEM_HEAP_DYNAMIC || n <= MEM_MAX_ALLOC_IN_BUF);
  block = static_cast<mem_block_t *>(ut_malloc_nokey(len));
} else {
  len = UNIV_PAGE_SIZE;
    
  // case 2
  if ((type & MEM_HEAP_BTR_SEARCH) && heap) {
    // Obtain the memory from the free block of the heap root.
    buf_block = static_cast<buf_block_t *>(heap->free_block);
    heap->free_block = nullptr;
    if (UNIV_UNLIKELY(!buf_block)) {
      return (nullptr);
    }
  } else {
    // case 3
    buf_block = buf_block_alloc(nullptr);
  }
  block = (mem_block_t *)buf_block->frame;
}

This code achieves the following effects:

Control the upper limit of a single block UNIV_PAGE_SIZE.
heap->free_block=nullptr ensures that the free block of the root node will not be reused. This also explains why memory allocation may fail when the type contains MEM_HEAP_BTR_SEARCH bits. The following are the reasons:
- The current block type is incompatible with the type of mem_heap_t->base. If the original root node does not contain MEM_HEAP_BTR_SEARCH bits when applying for memory, the free block is nullptr during construction, and a null pointer will be obtained in line 12 and returned directly.
- The free block corresponding to mem_heap_t->base that the current block relies on has been used. As can be seen from line 13, as long as it is used once, the free block will be marked as empty, and the real data will be transferred to buf_block.

2) Initialization of the Block

This step mainly includes the setting of various parameters in several mem_heap_t node objects of the block, including len, type, and free. This article focuses on the setting of buf_block and free_block, which is also very subtle.

UNIV_MEM_FREE(block, len);
UNIV_MEM_ALLOC(block, MEM_BLOCK_HEADER_SIZE);
block->buf_block = buf_block;
block->free_block = nullptr;

The first two lines set the data corresponding to the block to the free state, and initialize the data in the head at the same time to prepare for the initialization of len and other data. The settings of the last two lines vary at different conditions, which are explained by the following cases:

Case 1 - The type is MEM_HEAP_DYNAMIC: In this condition, block->buf_block=nullptr, which conforms to the definition of this type by mem_heap_t. The memory structure of the block is as follows (the head has been initialized).

Case 2 - The type is MEM_HEAP_BTR_SEARCH: The memory of the block is allocated from free_block. At this time, the memory in free_block is transferred to buf_block, and the data required by the block is constructed from buf_block.

Case 3 - The type is MEM_HEAP_BUFFER: The memory is allocated by buf_block_alloc from BP.

The final form of the memory structure in Case 2/3 is the same, except that the structure in Case 2 is converted from free_block to buf_block, while that in Case 3 is directly applied from BP. The free_block parameter is generally specified during the construction of mem_heap_t.

It can be seen that whether in Case 1, Case 2, Case 3, or a combination of different cases, the data can be set correctly through the modification of buf_block and free_block.

1.2.5 Release of the Block: mem_heap_block_free

Obtain buf_block. In alloc mode, nullptr is obtained.
Remove and modify the total_size parameter from the mem_heap_t linked list.
If the block is applied by ut_alloc mode, it is released by calling the ut_free mode. Otherwise, the block data is initialized (because the block obtained from BP/free_block may be free except for the head) and released in the buf_block_free mode. It then becomes a free page available in BP.

1.2.6 Apply for Memory from mem_heap_t: mem_heap_alloc

Obtain the last block and allocate memory from the last block.
Apply for a memory area of a given size. If it is not enough, call mem_heap_add_block to add a new block. MEM_HEAP_BTR_SEARCH may fail for the same reason as above.
Update the free value (the available space becomes smaller after application), initialize the memory area, and return the data pointer buf (block + free offset).

1.2.7 Policy for Adding a Block: mem_heap_add_block

Each time a block is added, its size is twice the size of the previous block.
Call mem_heap_create_block and add a new block to the end of the linked list.
Finally, a new block is returned.

2. Memory Allocation Manager MEM_ROOT at the SQL Layer

In addition to the basic form of alloc/free, the SQL layer mainly uses the MEM_ROOT structure to reduce the time and resource consumption for memory operation. This article focuses on MEM_ROOT.

As a generic memory management object, MEM_ROOT is widely applied at the SQL layer. For example, it is included as a memory allocator in structures such as THD and TABLE_SHARE. In fact, MEM_ROOT is only responsible for memory management. The structure that allocates memory is the block. MEM_ROOT only contains one block and is only responsible for the current unique block. The block contains a pointer pointing to the previous block node and is linked into a linked list.

Unlike mem_heap_t mentioned in the summary of 1.2.1, MEM_ROOT is mainly responsible for memory allocation at the SQL layer, while mem_heap_t is implemented separately in InnoDB and is responsible for memory allocation at the InnoDB layer. However, the structure and the implementation mode of the two are similar.

2.1 MEM_ROOT Data Structure

Block is the core structure from which all memory allocations originate. The block contains the pointer prev pointing to the previous block, and retains the flag of end as the address range, indicating the memory range managed by the block.
m_block_size records the total size of memory blocks to be allocated and managed by MEM_ROOT next time. When a new block is applied, the value will be updated to 1.5 times the original value.
m_allocated_size records the total memory allocated by MEM_ROOT from the OS. This value is also updated each time a new block is allocated.
m_current_block, m_current_free_start, and m_current_free_end record the start address, idle address, and end address of the block currently managed respectively.
m_max_capacity defines the maximum memory managed by MEM_ROOT, m_error_for_capacity_exceeded is the control switch for memory exceeding the upper limit, m_error_handler is the pointer to the function handling memory exceeding error, and m_psi_key is the PFS memory monitoring point.

2.2 Key Interfaces of MEM_ROOT

2.2.1 Constructor and Assignment

The original construction method of MEM_ROOT is very simple. Only m_block_size, m_orig_block, and m_psi_key are assigned values. At the same time, MEM_ROOT takes over the held MEM_ROOT by using a mobile constructor and mobile assignment. The logic is as follows:

// Mobile constructor.
MEM_ROOT(MEM_ROOT &&other)
  noexcept
      : m_xxx(other.m_cxxx),
        ...{
    other.m_xxx = nullptr/0/origin_value;
  ...
  }
// Mobile assignment.
MEM_ROOT &operator=(MEM_ROOT &&other) noexcept {
    Clear();
    ::new (this) MEM_ROOT(std::move(other));
    return *this;
  }

2.2.2 Alloc

The alloc function returns a new starting address from the currently managed and existing block according to the size of the required memory specified. At the same time, it updates the memory usage information. If the size of blocks managed by MEM_ROOT does not meet the requirements, the AllocSlow function is called to allocate and manage new blocks. Also, note that the returned address is always 8-aligned.

2.2.3 AllocSlow

The function AllocSlow is used to apply for a new block. At the underlying layer, two allocation modes are called according to different scenarios, and the returned memory addresses are also aligned.

When the required memory is large or there is a demand for exclusive memory, after applying for a new memory block, the block of the new request will not be set as the block currently managed (unless it is MEM_ROOT's first request), but will be set as the penultimate block in the linked list (that is, the previous node of current_block). Designers do not expect large memory requests and exclusive memory requests to interfere with subsequent memory allocation. Large memory requests cause a larger cardinality of 1.5 times when subsequent blocks are allocated, making it difficult to control the increase in the number of memory requests. In addition, if subsequent memory allocations are connected to memory blocks that require exclusive memory, the memory control is complicated. The preceding problems can be avoided by keeping the original current_block.
In other cases, you can append memory blocks to the end of the current_block and update the current_block to allocate.

void *MEM_ROOT::AllocSlow(size_t length) {
  // The memory applied is very large or an exclusive memory is required.
  if (length >= m_block_size || MEM_ROOT_SINGLE_CHUNKS) {
    Block *new_block =
        AllocBlock(/*wanted_length=*/length, /*minimum_length=*/length);
    if (new_block == nullptr) return nullptr;
    if (m_current_block == nullptr) {
      new_block->prev = nullptr;
      m_current_block = new_block;
      m_current_free_end = new_block->end;
      m_current_free_start = m_current_free_end;
    } else {
      // Insert the new block in the second-to-last position.
      new_block->prev = m_current_block->prev;
      m_current_block->prev = new_block;
    }
    return pointer_cast<char *>(new_block) + ALIGN_SIZE(sizeof(*new_block));
} else { // Normal conditions.
    if (ForceNewBlock(/*minimum_length=*/length)) {
      return nullptr;
    }
    char *new_mem = m_current_free_start;
    m_current_free_start += length;
    return new_mem;
  }
}

2.2.4 AllocBlock

AllocBlock is the basic function of block allocation. At the underlying layer, the my_malloc function is called to apply for memory. Data is counted based on PSI information and PFS switches. The my_malloc and my_free functions will be briefly described later.

Large memory requests may fail when the error flag of memory exceeding the limit is set. AllocBlock allows you to specify the wanted_length and minium_length parameters. In some cases, the memory size of the minium_length can be allocated. After each allocation, the value of the m_block_size parameter is set to 1.5 times its current value. This prevents frequent alloc calls.

2.2.5 ForceNewBlock

The ForceNewBlock function corresponds to the second memory allocation method of AllocSlow. It directly calls AllockBlock to apply for the memory block, then mounts it at the end of the block linked list and sets it to the current block managed by MEM_ROOT.

2.2.6 Clear

The clear function has simple execution logic and involves the following operations:

Set all states of MEM_ROOT to the initial state.
Traverse and release the nodes in the block linked list.

2.2.7 ClearForReuse

When the previously used memory no longer needs to be released, and you do not want to use MEM_ROOT again and run the process of alloc again, ClearForReuse can play a vital role. Unlike the clear function that frees all blocks, ClearForReuse keeps the current block and releases other nodes. In other words, after the ClearForReuse operation, only the last node is left in the block linked list. However, in the scenario of exclusive memory, the code logic still is Clear().

2.2.8 Others

The MEM_ROOT memory allocation method is byte-aligned. It operates by rounding the required memory length in the upper-layer interface such as alloc. However, MEM_ROOT also provides interfaces for "non-standard" operations. It provides functions such as peek and RawCommit and supports direct operations on the underlying blocks. Note that such operations do not occur frequently, and the memory will be rounded again the next time an operation such as alloc is used.

2.3 Application of MEM_ROOT in THD

MEM_ROOT is frequently used at the SQL layer, such as THD, THD::transactions, Prepared_statement:, TABLE_SHARE, sp_head, sp_head, and table_mapping. Taking the commonly used THD scenario as an example, this article briefly introduces the application of MEM_ROOT at the SQL layer.

THD contains three MEM_ROOT (including objects and pointers): main_mem_root, user_var_events_alloc, and mem_root.

2.3.1 main_mem_root

The MEM_ROOT object, which is destructed with the THD structure, is mainly used for parsing and runtime data storage involved in the execution of SQL statements.

This memory root is used for two purposes: - for conventional queries, to allocate structures stored in main_lex during parsing, and allocate runtime data (execution plan, etc.) during execution. - for prepared queries, only to allocate runtime data. The parsed tree itself is reused between executions and thus is stored elsewhere.

THD::THD(bool enable_plugins)
    : Query_arena(&main_mem_root, STMT_REGULAR_EXECUTION),
      ...
      lex_returning(new im::Lex_returning(false, &main_mem_root)),
      ... {
  main_lex->reset();
  set_psi(nullptr);
  mdl_context.init(this);
  init_sql_alloc(key_memory_thd_main_mem_root, &main_mem_root,
                  global_system_variables.query_alloc_block_size,
                  global_system_variables.query_prealloc_size);
  ...
 }

2.3.2 mem_root

The current mem_root pointer points to main_mem_root during THD initialization, but it will change in practice. MEM_ROOT of other objects is used to apply for memory by temporarily changing the mem_root pointer. After that, the mem_root pointer is changed to point to the initial memory address (main_mem_root).

Q: Why is mem_root designed as a changeable object? Why is the memory pointer of mem_root embedded into THD?

A: This design allows for convenient control of memory size. If thd->mem_root always points to main_mem_root, the corresponding memory will persist until THD is destructed. By changing the mem_root pointer, we can better control the memory life cycle, release temporarily occupied memory, and separate it from long-standing memory. Embedding the mem_root pointer in THD (essentially its parent class Query_arena) can provide clearer statistics on memory occupied by THD and simplify the management process. Although this portion of memory is generated during statement execution rather than directly by THD, the "responsibilities" are attributed to THD. This simplifies parameter transfers and reduces the need for an additional MEM_ROOT parameter, as parameters can be directly transferred to THD.

THD::THD(bool enable_plugins)
    : Query_arena(&main_mem_root, STMT_REGULAR_EXECUTION),
  ...
MEM_ROOT* old_mem_root = thd->mem_root; // Save the original mem_root (main_mem_root).
thd->mem_root = xxx_mem_root; // mem_root is mostly temporary MEM_ROOT.
// do something using memory
...
thd->mem_root = old_mem_root; // Restore to the original mem_root (main_mem_root).

The temporary replacement of mem_root occurs in the following locations, but due to the design of MEM_ROOT, such as the mobile construction, the statistics of memory resources will continue to use the previous PSI_MEMORY_KEY without causing complexity and confusion in the statistics.

// sql/dd_table_share.cc
open_table_def()
// sql/sp_head.cc
sp_parser_data::start_parsing_sp_body() &&
sp_parser_data::finish_parsing_sp_body()
// sql/sp_instr.cc PSI_NOT_INSTRUMENTED
LEX *sp_lex_instr::parse_expr()
// sql/sql_cursor.cc
Query_result_materialize::start_execution()
// sql/sql_table.cc
rm_table_do_discovery_and_lock_fk_tables()
drop_base_table()
lock_check_constraint_names()
// sql/thd_raii.h The type and where it is called (sql/auth/sql_auth_cache.cc:grant_load()).
class Swap_mem_root_guard;
// sql/auth/sql_authorization.cc
mysql_table_grant() // Stores table-level and row-level permissions.
mysql_routine_grant() // Stores route-level permissions.
/* sql/dd/upgrade_57/global.h  storage/ndb/pligin/ndb_dd_upgrade_table.cc
    The type and where it is called. */
class Thd_mem_root_guard

2.2.3 user_var_events_alloc

The memroot pointer that is used to allocate Binlog_user_var_event array elements in THD, usually pointing to the same location as the thd->mem_root pointer.

3. Summary

MySQL has made significant efforts and optimizations in the allocation, usage, and management of memory. Each module is a separate memory allocation management system, and its design and usage policies are worth learning.

In the latest 8.0 version, ut_allocator has been removed from the InnoDB layer. The corresponding memory request and release code have been modified to template functions. By using mem_heap_t, memory fragmentation is effectively reduced, making it suitable for scenarios where small amounts of memory are allocated multiple times within a short cycle. However, mem_heap_t does not free memory during usage, leading to some memory wastage when a single block becomes idle.

MEM_ROOT is the most commonly used memory allocator at the SQL layer. Similar to mem_heap_t, it also faces the issue of block fragmentation, but it provides a ClearForReuse interface in its design to release previously occupied memory in a timely manner. Additionally, MEM_ROOT also considers scenarios of exclusive memory and large memory, reducing the memory size for subsequent applications. Furthermore, the flexible use of the MEM_ROOT pointer in the THD structure provides new ideas for memory usage, which is worth learning from.