Troubleshoot the high memory usage on a Tair instance - Tair

Tair provides a highly efficient database service. Insufficient memory may cause issues such as frequently evicted keys, increased response time, and an unstable number of queries per second (QPS). These issues may interrupt your business. If the memory usage exceeds 95%, you must respond in a timely manner.

Memory usage of Tair

The memory usage of Tair is divided into the following three parts:


Memory usage	Description
Memory consumed by link-related operations	Includes the memory consumed by the input buffer, the memory consumed by the output buffer, the memory consumed by the JIT overhead, the memory consumed by the Fake Lua Link, and the memory consumed to cache the executed Lua scripts. The memory consumption dynamically changes. You can run the INFO command and obtain the client cache information from the Clients column in the output. Note The memory consumed by the input buffer and output buffer is small and varies based on the number of connections from each client. When a client initiates range-based operations or when a client sends and receives large keys at low speeds, the memory consumed by the input buffer and output buffer increases. As a result, the memory that can be used to store data decreases, and out of memory (OOM) issues may occur.
Memory consumed by data	Includes the memory consumed to store field values. This part of memory consumption is a crucial object that needs to be analyzed.
Memory consumed by management operations	Includes the memory consumed by hash sets, the memory consumed by the replication buffer, and the memory consumed by the append-only file (AOF) buffer. The memory consumption remains stable within the range of 32 MB to 64 MB, which is small. Note A large number of keys, such as hundreds of millions, consume large amounts of memory.

Note Most OOM issues occur due to inefficient management of dynamically acquired and released memory. For example, if a large number of requests are piled up due to throttling, the amount of dynamically acquired memory rapidly increases. OOM issues may also occur due to complex or inappropriate Lua scripts.

Step 1: Analyze memory usage

View the memory usage of your Tair instance over a specified time range. For more information, see View monitoring data.

In the following example, the memory usage that is indicated by the Memory Usage metric remains approximately 100%.

Figure 1. Memory usage example

Note If you select Data Node Aggregation to view the memory usage of a cluster instance or read /write splitting instance, the Memory Usage metric indicates the average memory usage of all data nodes except read replica nodes in the instance.
Check whether the total number of evicted keys and the maximum command latency significantly increase.

In the following example, the total number of evicted keys and the maximum command latency increased at 16:00:00 (UTC+8) on January 7, 2021. This indicates that the available memory resources are insufficient.

Note The metrics that you must view are Evicted Keys and Max Rt. The Evicted Keys metric indicates the total number of evicted keys. The Max Rt metric indicates the maximum amount of time that a data node requires to return a response after the data node receives a command.must

Figure 2. Performance monitoring example

Optional:If the memory usage of your Tair instance does not meet your expectations, perform the following operations to analyze the memory usage in detail.

Use redis-cli to connect to a Tair instance.

In the redis-cli CLI, run the MEMORY STATS command to query the memory usage of your Tair instance.

The memory consumption of a Tair instance consists of the following major parts:

The memory consumed by business data. This part of memory consumption is a crucial object that needs to be analyzed.
The memory consumed by non-business data. This includes the memory consumed by the backlog buffer of master-replica replication and the memory consumed to initialize the Tair process.

Sample responses and parameters:

Note In the following sample responses, the size of consumed memory is measured in bytes.

 1) "peak.allocated" // The peak memory that the Tair process has consumed over its lifetime so far. 
 2) (integer) 79492312
 3) "total.allocated" // The total number of bytes that are allocated to run the Tair process. This is the current total memory usage. 
 4) (integer) 79307776
 5) "startup.allocated" // The memory consumed by the Tair process at startup. 
 6) (integer) 45582592
 7) "replication.backlog" // The size of the replication backlog buffer. 
 8) (integer) 33554432
 9) "clients.slaves" // The size of the read and write buffer in all replica nodes for master-replica replication. 
10) (integer) 17266
11) "clients.normal" // The size of the read and write buffers in other clients that are connected to all data nodes except replica nodes. 
12) (integer) 119102
13) "aof.buffer" // The cache used for AOF persistence and the cache generated during AOF rewrite operations. 
14) (integer) 0
15) "db.0"  // The number of databases. 
16) 1) "overhead.hashtable.main" // The total memory consumed by the hash tables in the current database. This is the memory consumed to store metadata. 
    2) (integer) 144
    3) "overhead.hashtable.expires" // The memory consumed to store expired keys. 
    4) (integer) 0
17) "overhead.total" // The value of the overhead.total parameter is calculated based on the following formula: overhead.total = startup.allocated + replication.backlog + clients.slaves + clients.normal + aof.buffer + db. X. 
18) (integer) 79273616
19) "keys.count" // The total number of keys in the current Tair instance.
20) (integer) 2
21) "keys.bytes-per-key" // The average size per key in the current Tair instance. Formula: (total.allocated-startup.allocated)/keys.count. 
22) (integer) 16862592
23) "dataset.bytes" // The memory consumed by business data. 
24) (integer) 34160
25) "dataset.percentage" // The percentage of the memory consumed by business data. Formula: dataset.bytes × 100/(total.allocated - startup.allocated). 
26) "0.1012892946600914"
27) "peak.percentage" // The percentage of the current total memory usage to the historical peak memory usage. Formula: total.allocated × 100/peak.allocated. 
28) "99.767860412597656"
29) "fragmentation" //The memory fragmentation rate. 
30) "0.45836541056632996"

In the Tair command-line interface, run the MEMORY USAGE command to query the memory consumed by specified keys. Unit: bytes.
Sample command:
```
MEMORY USAGE Key0089393003
```
Sample output:
```
(integer) 1000072
```

In the Tair CLI, run the MEMORY DOCTOR command to obtain memory diagnostic suggestions.

After you run the MEMORY DOCTOR command, the diagnostic suggestions for your Tair instance are provided from the following dimensions. You can make optimization decisions based on the diagnostic suggestions.

    int empty = 0;     /* Instance is empty or almost empty. */
    int big_peak = 0;       /* Memory peak is much larger than used mem. */
    int high_frag = 0;      /* High fragmentation. */
    int high_alloc_frag = 0;/* High allocator fragmentation. */
    int high_proc_rss = 0;  /* High process rss overhead. */
    int high_alloc_rss = 0; /* High rss overhead. */
    int big_slave_buf = 0;  /* Slave buffers are too big. */
    int big_client_buf = 0; /* Client buffers are too big. */
    int many_scripts = 0;   /* Script cache has too many scripts. */

Step 2: Optimize memory usage

Check whether the existing keys meet business requirements and delete unnecessary keys in a timely manner.
Use the cache analytics feature to analyze the distribution of large keys and the time-to-live (TTL) of keys. For more information, see Offline key analysis.
1. Check whether proper TTL values are configured for keys.
  
  Note In the following example, no TTL values are configured for keys. We recommend that you configure proper TTL values on your client based on your business requirements.
  
  Figure 4. Example distribution of TTL values for keys
2. Evaluate large keys and split these keys based on your business requirements.
  
  Figure 5. Example of large key analysis
Configure a proper eviction policy or modify the value of the maxmemory-policy parameter based on your business requirements. For more information, see Modify parameters of an instance.

Note By default, the eviction policy of Tair is volatile-lru. For more information, see Supported parameters.
Set the frequency of deleting expired keys to a proper value or modify the value of the hz parameter based on your business requirements. For more information, see Change the frequency of background tasks.

Note We recommend that you set the hz parameter to a value that is smaller than 100. A large value affects CPU utilization. You can also configure the system to automatically modify the value if your instance is a DRAM-based or persistent memory-optimized instance. For more information, see Enable dynamic frequency control for background tasks.
If the traffic usage is still high after you perform the preceding optimizations, upgrade your instance to an instance type that has more memory. An upgrade improves instance performance and allows the instance to handle more traffic. For more information, see Change the configurations of an instance.

Note Before you upgrade your Tair instance, you can purchase a pay-as-you-go instance to test whether the upgrade specifications meet your workload requirements. You can release the pay-as-you-go instance after you complete the test. For more information, see Release pay-as-you-go instances.