How do I troubleshoot high memory usage issues - ApsaraDB for MongoDB

Memory usage is a key metric to monitor an ApsaraDB for MongoDB instance. This topic describes how to view memory usage details and troubleshoot high memory usage issues on an instance.

Background information

MongoDB processes load binary files and dependent system library files to the memory. MongoDB processes also allocate and release memory, including managing client connections and storage engines and processing requests. By default, MongoDB uses TCMalloc from Google as a memory allocator. Most of the memory is consumed by the WiredTiger storage engine, client connections, and request processing.

Access method

For a sharded cluster instance, the memory usage on each shard node is the same as that on a replica set instance. The Configserver node stores only configuration metadata. The memory usage on mongos nodes is affected by aggregated result sets, the number of connections, and the size of metadata.

For a replica set instance, you can use the following methods to view the memory usage:

View memory usage in monitoring charts
A replica set instance consists of multiple node roles. Each node role can correspond to one or more physical nodes. A replica set instance consists of a primary node that supports read and write operations, one or more high-availability secondary nodes, a hidden node, and one or more optional read-only nodes.
On the Monitoring Data page of an instance in the ApsaraDB for MongoDB console, view the memory usage on the corresponding node in monitoring charts.

View IOPS usage (%) by running commands

To view and analyze the memory usage on an instance, run the db.serverStatus().mem command in the mongo shell. Sample result:

{ "bits" : 64, "resident" : 13116, "virtual" : 20706, "supported" : true }
// resident indicates the physical memory that is consumed by the mongod nodes. Unit: MB. 
// virtual indicates the virtual memory that is consumed by the mongod nodes. Unit: MB.

Note

For more information about serverStatus, see serverStatus.

Common causes

High memory usage of the engine

The WiredTiger storage engine consumes the largest portion of the memory. For compatibility and security purposes, ApsaraDB for MongoDB sets the cachesize parameter to 60% of the actual memory of an instance. For more information, see Specifications.

If the size of cached data exceeds 95% of the configured cache size, high memory usage occurs. In this case, the threads originally processing user requests also evict dirty pages for protection purpose. User request congestion is obvious. For more information, see Eviction parameters.

You can use the following methods to view the memory usage of the engine:

Run the db.serverStatus().wiredTiger.cache command in the mongo shell. The bytes currently in the cache value is the memory size. Sample output:

{
   ......
   "bytes belonging to page images in the cache":6511653424,
   "bytes belonging to the cache overflow table in the cache":65289,
   "bytes currently in the cache":8563140208,
   "bytes dirty in the cache cumulative":NumberLong("369249096605399"),
   ......
}

On the Dashboard page of an instance in the DAS console, view the percentage of dirty data in the WiredTiger cache. For more information, see Performance trends.
Use the mongostat tool of ApsaraDB for MongoDB to check the percentage of dirty data in the WiredTiger cache. For more information, see mongostat.

High memory usage of connections and requests

If a large number of connections are connected to the instance, the memory may be consumed due to the following reasons:

Each connection has a corresponding thread to process the request in the background. Each thread can use up to 1 MB of stack space. In most cases, the overhead is dozens to hundreds of KB.
Each TCP connection has read and write buffers at the kernel layer. The buffer size is determined by TCP kernel parameters such as tcp_rmem and tcp_wmem. You do not need to specify the buffer size. However, more concurrent connections consume larger amounts of socket cache space and result in higher memory usage of TCP.
Each request has a unique context. Multiple temporary buffers may be allocated for request packets, response packets, and ordering. The temporary buffers are gradually released at the end of each request. The buffers are first released to the TCMalloc cache and then gradually released to the operating system.
In many cases, memory usage is high because TCMalloc fails to promptly release the memory that requests consume. Requests may consume up to dozens of GB of memory before the memory is released to the operating system. To query the size of memory that TCMalloc has not released to the operating system, run the db.serverStatus().tcmalloc command. The TCMalloc cache size is the sum of the pageheap_free_bytes and total_free_byte values. Sample output:
```
{
   "generic":{
           "current_allocated_bytes":NumberLong("9641570544"),
           "heap_size":NumberLong("19458379776")
   },
   "tcmalloc":{
           "pageheap_free_bytes":NumberLong("3048677376"),
           "pageheap_unmapped_bytes":NumberLong("544994184"),
           "current_total_thread_cache_bytes":95717224,
           "total_free_byte":NumberLong(1318185960),
......
   }
}
```
Note
For more information about TCMalloc, see tcmalloc.

High memory usage of metadata

Metadata includes memory usage of databases, collections, and indexes. You must pay attention to the memory that is consumed by large numbers of collections and indexes. Especially in versions earlier than MongoDB 4.0, full logical backups may open large numbers of file handles. These file handles may not be promptly returned back to the operating system and result in rapid memory usage growth. The file handles may not be deleted after large numbers of collections are deleted and result in memory leaks.

High memory usage for creating indexes

In normal data writes, secondary nodes maintain a buffer of about 256 MB for data replay. After the primary node creates indexes, secondary nodes may consume more memory for data replay. In versions earlier than MongoDB 4.2, indexes are created in the background on the primary node. Serial replay for index creation may consume a maximum of 500 MB memory. In MongoDB 4.2 and later, indexes cannot be created in the background and secondary nodes can perform parallel replay for index creation. This requires more memory because instance OOM errors may occur when multiple indexes are created at a time.

Note

For more information, see index-build-impact-on-database-performance and index-build-process.

High memory usage of PlanCache

If a request has a large number of execution plans, the PlanCache methods may consume large amounts of memory. To view the memory usage of PlanCache in later versions of ApsaraDB for MongoDB, run the mgset-xxx:PRIMARY> db.serverStatus().metrics.query.planCacheTotalSizeEstimateBytes command. For more information, see Secondary node memory arise while balancer doing work.

Solutions

The goal of memory optimization is not to minimize memory usage. Instead, memory optimization seeks a balance between resource consumption and performance. Ideally, the memory remains sufficient and stable and system performance is not affected. You cannot change the cachesize value that ApsaraDB for MongoDB configures. We recommend that you use the following methods to optimize memory usage:

Control the number of concurrent connections. 100 persistent connections can be created in a database based on the results of performance tests. By default, a MongoDB driver can establish 100 connection pools with the backend. If a large number of clients exist, you must reduce the size of the connection pool for each client. We recommend that you establish no more than 1,000 persistent connections in a database. Otherwise, overheads in the memory and multi-thread context may increase and result in processing latency for requests.
Reduce the memory overhead of a single request. For example, you can create indexes to reduce the number of collection scans and perform memory ordering.
If the number of connections is appropriate but the memory usage continues to increase, we recommend that you upgrade the memory configurations. Otherwise, system performance may sharply decline due to OOM errors and extensive cache clearing.
In scenarios where memory leaks may occur, contact Alibaba Cloud technical support.

References

**Eviction parameters**
Parameter	Default value	Description
eviction_target	80	When the used cache size is larger than the eviction_target value, eviction threads evict clean pages in the background.
eviction_trigger	95	When the used cache size is larger than the eviction_trigger value, user threads evict clean pages in the background.
eviction_dirty_target	5	When the size of dirty data in cache is larger than the eviction_dirty_target value, eviction threads evict dirty pages in the background.
eviction_dirty_trigger	20	When the size of dirty data in cache is larger than the eviction_dirty_trigger value, user threads evict dirty pages in the background.