×
Community Blog Learn Nearly Everything About Redis Through An Incident

Learn Nearly Everything About Redis Through An Incident

This article summarizes almost everything you need to know about Redis from an incident.

By Kanpo

Simple Review

To recap briefly:

(1) During the day, operations on large keys in the Tair (Redis®* OSS-Compatible) instance increased as traffic grew and the bandwidth usage reached 100% at peak hours.

1

(2) As a result, the memory usage of the instance surged from 0% to 100% within 5 minutes.

2

(3) At 11:22:02, all GET and SET commands timed out.

3
4

(4) Timeout errors were returned.

Unsolved Mystery

Question: Does 100% memory usage mean that Redis® is unavailable?

Answer: No, 100% memory usage does not mean that Redis® is unavailable under normal circumstances.

Redis® has a cache eviction mechanism. When the memory usage reaches 100%, Redis® does not crash. Instead, it uses eviction policies to free up memory and ensure system stability.

5

For more information, see Replacement strategies: What do I do if the cache is full?

The following figure shows the eviction policy of the instance.

6

Most users tend to leave this configuration as it is.

By default, the volatile-lru policy is used.

• The volatile-lru policy uses the Least Recently Used (LRU) algorithm to evict keys, and evicts only the least recently used keys for which expiration time is set. By comparison, the allkeys-lru policy evicts all the least recently used keys.

• The volatile-lru policy is applicable if you use Redis® to cache data and do not mind losing some data.

You must configure an appropriate eviction policy based on your business requirements to ensure that the system can stably process new requests and perform write operations as expected when the memory usage reaches the upper limit. The noeviction policy is not supported.

With the instance in this incident, SET and GET commands should have been processed as expected when the memory usage reached 100%. (More complex commands are not considered here.)

• Redis® does not crash easily. However, if memory is exhausted and no eviction policy is configured or the eviction policy fails to take effect, Redis® may reject new write operations and return the error: OOM command not allowed when used memory > 'maxmemory'.

• If the system is not properly configured or the memory of the operating system is not properly managed, the Redis® process may be killed by the operating system.

Question: In this incident, the instance was unavailable when the memory usage reached 100%. How did that happen?

Guess 1: The incident was caused by a performance bottleneck due to an untimely eviction.

In other words, the write speed was considerably faster than the eviction speed.

Answer: This cannot be the case with normal business data write operations.

• Redis® relies on pure memory, which ensures a high eviction speed.

• The related business in this incident does not involve highly frequent write operations.

Only a small number of keys are stored in the instance, which occupies less than 5% of the entire memory.

7

The memory usage surge in this incident cannot be attributed to an increasing number of keys.

A preliminary conclusion can be drawn from the preceding discussion: Redis® crashes are generally not caused by the write speed exceeding the eviction speed, especially when an appropriate eviction policy is used. If the write speed is so high that the eviction policy cannot clear expired data in time, Redis® may frequently search for expired keys and evict them, resulting in performance degradation.

For more information, see Fluctuating response latency: How do I deal with slow Redis®? (Part 1).

Performance degradation can be caused by the following mechanism:

Redis® uses an automatic deletion mechanism to delete expired keys and reclaim memory. This mechanism is applicable to a wide range of scenarios. However, this process can block Redis® operations and degrade the Redis® performance.

Redis® supports expiration time on keys. By default, Redis® deletes some expired keys every 100 milliseconds based on the following algorithm:

  1. Sample a number of keys and delete expired keys. The number of keys to be sampled is specified by the ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP parameter.
  2. If more than 25% of the sampled keys have expired, repeat the sample and deletion process until the proportion of expired keys is or below 25%.

ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP is a parameter of Redis®. Its default value is 20, which means that 200 expired keys can be deleted within 1 second. This mechanism is helpful in clearing expired keys and freeing up memory. Deleting 200 expired keys every second does not affect the performance of Redis®.

However, if the second rule of the algorithm is continuously triggered, Redis® keeps deleting keys to free up memory. Take note that delete operations cause blocking. (Redis® 4.0 and later use an asynchronous thread mechanism to reduce blocking impacts.) Therefore, if the second rule is continuously triggered, the Redis® thread keeps deleting keys and cannot serve other key-value operations as expected. This causes increased delays in other key-value operations and causes Redis® to run slowly.

A major reason why the second rule is continuously triggered is that EXPIREAT commands with the same timestamp are frequently used to set the expiration times of keys. This causes a large number of keys to expire at the same second.

The impact is similar to the impact on the performance caused by frequent garbage collection (GC) on the Java VM (JVM).

Guess 2: The incident was caused by excessive abnormal business data write operations.

What exactly caused the surge in memory usage?

8

Clues

For more information, see How do I resolve the sudden increase in memory usage of an instance?

Evidence

9
10

Truth

The memory usage surge was caused by buffer overflow.

For more information, see Buffer: A place that could start a 'disaster'

Knowledge Point: Composition of Redis® Memory Usage

11
12

Use the INFO MEMORY command to analyze the memory usage. The following example shows the command output in a simulated case of buffer overflow. It is not the command output in the incident described in this article.

# Memory
used_memory:1072693248
used_memory_human:1023.99M
used_memory_rss:1090519040
used_memory_rss_human:1.02G
used_memory_peak:1072693248
used_memory_peak_human:1023.99M
used_memory_peak_perc:100.00%
used_memory_overhead:1048576000
used_memory_startup:1024000
used_memory_dataset:23929848
used_memory_dataset_perc:2.23%
allocator_allocated:1072693248
allocator_active:1090519040
allocator_resident:1090519040
total_system_memory:16777216000
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.89K
used_memory_scripts:1024000
used_memory_scripts_human:1.00M
maxmemory:1073741824
maxmemory_human:1.00G
maxmemory_policy:noeviction
allocator_frag_ratio:1.02
allocator_frag_bytes:17825792
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:0
mem_fragmentation_ratio:1.02
mem_fragmentation_bytes:17825792
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:1048576000
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

Analysis and Interpretation

From the preceding output of the INFO MEMORY command, specific information reveals that the buffers have consumed almost all the memory.

1. Overall memory usage:

used_memory: 1072693248
maxmemory: 1073741824

This indicates that the used memory has almost reached the specified upper limit and the memory has been used up.

2. Memory used by buffers:

used_memory_overhead: 1048576000

This parameter specifies the memory overheads that Redis® allocated for buffers, connections, and other metadata. In this example, these memory overheads (used_memory_overhead: 1048576000) constitute most of the overall memory usage (used_memory: 1072693248).

3. Memory used by datasets:

used_memory_dataset: 23929848
used_memory_dataset_perc: 2.23%

This indicates that the stored data occupies only about 23.93 MB of memory. Most memory is occupied by buffers.

4. Memory used by client buffers:

mem_clients_normal: 1048576000

This indicates that normal client connections have consumed about 1 GB of memory, which usually means that the output buffers may have approached or reached the specified limit.

5. Memory fragmentation ratios:

allocator_frag_ratio: 1.02
mem_fragmentation_ratio: 1.02

The fragmentation ratios are not high, which indicates that the memory is reasonably used but is disproportionately occupied by buffers.

Summary

As you can see from the above example, Redis® memory is almost exhausted by buffers. Conclusions:

• The used memory (used_memory: 1072693248) is close to the maximum memory (maxmemory: 1073741824).

• The memory overheads (used_memory_overhead) are large and are mainly used by normal client connections (possibly the output buffers). Data occupies only a small amount of memory.

• The allocator fragmentation ratio (allocator_frag_ratio) and Resident Set Size (RSS) fragmentation ratio (mem_fragmentation_ratio) are low, which indicates that fragmentation is not an issue.

Theoretical maximum buffer size

Purpose of buffers

How Redis® works (single-client simplified diagram):

13

How Redis® works (single-client detailed diagram):

14

Buffers are a memory area used to temporarily store commands and data. This prevents data loss and performance issues caused when data and commands are processed slower than they are transmitted.

How Redis® works with buffers (multi-client simplified diagram):

15

Query buffer

Definition

16

17

Memory usage

You cannot modify the size of query buffers.

The upper limit of the query buffer for each client is set to 1 GB in the code. In other words, the Redis® server allows each client to buffer up to 1 GB of commands and data. A 1-GB buffer is suitable for most production environments. It is enough for handling the requests from most clients. A higher limit may cause Redis® to crash due to excessive memory consumption by clients.

Output buffer

Definition

18

Memory usage

The size of the output of the Redis® server is usually not controllable. If large keys are requested, the Redis® server returns a large amount of data. If an excessive number of commands are processed, output data is generated faster than it is sent to clients, causing the messages to accumulate on the server and the output buffers to grow and consume a disproportionate amount of memory. In the worst case, the system crashes. Redis® uses the client-output-buffer-limit parameter to protect system security.

client-output-buffer-limit pubsub 8mb 2mb 60

In the preceding code, pubsub specifies that the specified limits apply to Publisher/Subscriber (Pub/Sub) clients. 8mb specifies that the upper limit of an output buffer is 8 MB. If the size of the output buffer reaches 8 MB, the server closes the client connection. 2mb and 60 specify that the server closes the client connection if the size of the output buffer remains larger than 2 MB for 60 seconds.

To deal with output buffer overflow, take note of the following items:

  • Avoid performing large key operations that return a large amount of data.
  • Avoid continuously using MONITOR commands in online environments.
  • Use the client-output-buffer-limit parameter to set an appropriate upper limit on output buffer size, or a limit on continuously writing data to the output buffer for a specific period of time.

The default value of the client-output-buffer-limit parameter for Tair (Redis® OSS-Compatible) instances is pubsub 32mb 8mb 60.

20

For Pub/Sub connections:

• Hard limit on output buffer size: 32 MB for each client connection.

• Maximum number of connections in the connection pool: 300. The theoretical maximum output buffer size is reached when all connections are processing Pub/Sub messages and the hard limit on output buffer size is reached.

Theoretical maximum output buffer size:

Theoretical maximum output buffer size = Hard limit on output buffer size × Maximum number of connections in the connection pool = 32 MB × 300 = 9,600 MB = 9.375 GB

For the instance discussed in this topic, the output buffer of all Pub/Sub connections can theoretically consume up to 9.375 GB of memory.

If the output buffers and query buffers reach the upper limits, the memory usage reaches 100%.

21
22

The sum of the maximum memory that can be used by the output buffer and query buffer is 10.375 GB, which is considerably larger than 2 GB, the memory of the instance.

Results:

The stored objects are subject to expiration time and are deleted when they expire.

• The buffer size grows.
• The objects in memory are regularly cleared.
• The memory is subject to the upper limit.

Eventually, the memory is entirely occupied by buffers.

When SET commands are sent to the server, no memory is available to receive the commands because the eviction policy evicts objects but not buffered data.

Conclusion:

After the entire memory is occupied by buffers, Redis® cannot run as expected, such as processing SET and GET commands to write and read data.

Reason for the buffer surge

The preceding discussion analyzed how Redis® stops working as expected after buffers consume the entire memory. The following section analyzes how the buffer size surged to occupy all the memory and make the instance unavailable.

Instance information

23

24

Code

25

Incident replay

1.  Natural traffic growth led to a continuous increase in the outbound bandwidth, up to 96 MB/s.

26

2.  The outbound bandwidth exceeded 96 MB/s, and the memory used by output buffers surged and even overflowed. As discussed in the preceding text, the output buffer size can be up to about 9 GB with 300 client connections.

27

3.  The output buffer size reached the upper limit and client connections were closed.

4.  After the client connections were closed, all requests were directed to the database.

5.  After the requests were processed by the database, the database sent SET commands to the instance.

6.  The number of SET commands with large keys soared, causing the inbound bandwidth to surge. Even though the queries per second (QPS) was only 50, the maximum inbound bandwidth could be reached shortly if each write operation wrote 2 MB of data.

28
29

7.  The main thread model of Redis® could not process large key requests fast enough, causing intermittent blocking. Requests could not be processed in time and accumulated in the query buffers.

8.  The query buffer size surged.

30

9.  Eventually, the memory was completely consumed by query and output buffers.

10.  Subsequent SET and GET commands could not get into query buffers, and the blocking lasted for the timeout period configured for the clients. However, the inbound and outbound bandwidth continued to increase. The total bandwidth reached 216 MB/s, which exceeded 192 MB/s, the maximum bandwidth of the instance.

31
32

11.  The instance became unavailable. For subsequent commands to get to the server, the commands accumulated in the query buffers must be processed to free up memory. However, the main thread model of Redis® could not process fast enough. The QPS plummeted, as shown in the following figure.

33
34

After 11:35, the instance was unavailable and all traffic was sent to the database.

35

Development and O&M Standards

As you can see from the preceding discussion, the performance of Redis® is subject to limits. High performance is not always guaranteed.

You must use benchmarks to understand performance.

36

The following table describes the issues that can cause performance bottlenecks in Redis®.

37

Computing resources Wildcard characters, concurrent Lua scripts, one-to-many Pub/Sub models, and hot keys consume a large amount of computing resources. For cluster instances, these items can also cause skewed requests and underutilization of data shards.
Storage resources Streaming jobs and large keys consume a large amount of storage resources. For cluster instances, these items can also cause data skew and underutilization of data shards.
Network resources Database-wide scans (by running the KEYS command) and range queries of large keys and values (by running the HGETALL command) consume a large amount of network resources and often cause thread congestion.
Important
The high-concurrency capability of Tair (Redis® OSS-Compatible) does not significantly improve access performance as expected but does affect the overall performance of Tair (Redis® OSS-Compatible). For example, the storage of large values in Tair (Redis® OSS-Compatible) does not improve access performance to a large degree.

In the incident described in this topic, high network resource consumption and high storage resource consumption occurred.

The following tables describe the development and O&M standards for Tair (Redis® OSS-Compatible) in terms of business deployment, key design, SDK usage, command usage, and O&M management.

• Business deployment standards: https://www.alibabacloud.com/help/en/redis/use-cases/development-and-o-and-m-standards-for-apsaradb-for-redis

• Key design standards: https://www.alibabacloud.com/help/en/redis/use-cases/development-and-o-and-m-standards-for-apsaradb-for-redis

• SDK usage standards: https://www.alibabacloud.com/help/en/redis/use-cases/development-and-o-and-m-standards-for-apsaradb-for-redis

• Command usage standards: https://www.alibabacloud.com/help/en/redis/use-cases/development-and-o-and-m-standards-for-apsaradb-for-redis

• O&M management standards: https://www.alibabacloud.com/help/en/redis/use-cases/development-and-o-and-m-standards-for-apsaradb-for-redis

Business Deployment Standards

Importance Standard Description
★★★★★ Determine whether Tair (Redis® OSS-Compatible) is used as a high-speed cache or an in-memory database. High-speed cache: We recommend that you disable append-only file (AOF) persistence to reduce overheads and avoid strong dependence on cached data because the cached data may be evicted. For example, when the memory usage of a Tair (Redis® OSS-Compatible) instance reaches the upper limit, the eviction policy is triggered to reclaim memory for new data to be written. Depending on the write workload of your business, this can result in increased latency.
Important
To use the data flashback feature, you must enable AOF persistence.
In-memory database: We recommend that you choose persistent memory-optimized instances of Redis Enhanced Edition (Tair). Persistent memory-optimized instances offer command-level persistence. In addition, you can monitor memory usage by configuring alert rules. For more information, see the "Alert settings" topic in the Tair (Redis® OSS-Compatible) documentation.
★★★★★ Deploy your business close to Tair (Redis® OSS-Compatible) instances. For example, you can deploy your business on an Elastic Compute Service (ECS) instance that resides in the same virtual private cloud (VPC) as your Tair (Redis® OSS-Compatible) instances. Tair (Redis® OSS-Compatible) is a high-performance database service. However, if you deploy your business server far from Tair (Redis® OSS-Compatible) instances and the business server is connected to the instances over the Internet, the performance of Tair (Redis® OSS-Compatible) is greatly compromised due to network latency.
Description
For cross-region deployment, you can use the geo-replication capability of Global Distributed Cache to implement geo-disaster recovery or active geo-redundancy, reduce network latency, and simplify business design. For more information, see the "Overview of Global Distributed Cache for Tair" topic in the Tair (Redis® OSS-Compatible) documentation.
★★★★☆ Create a Tair (Redis® OSS-Compatible) instance for each service. Do not use a Tair (Redis® OSS-Compatible) instance for different services. For example, do not use a Tair (Redis® OSS-Compatible) instance as a high-speed cache and an in-memory database. Otherwise, the eviction policy, slow queries, and FLUSHDB commands of one service affect other services.
★★★★☆ Configure appropriate eviction policies to evict expired keys. The default eviction policy for Tair (Redis® OSS-Compatible) is volatile-lru. For more information about eviction policies, see the "Parameters that can be configured for Redis Open-Source Edition instances" topic in the Tair (Redis® OSS-Compatible) documentation.
★★★☆☆ Properly manage stress testing data and duration. Tair (Redis® OSS-Compatible) does not delete stress testing data. To prevent impacts on your business, you must properly manage stress testing data and duration by yourself.

Key Design Standards

Importance Standard Description
★★★★★ Control the size of key values. We recommend that you keep the size of a value below 10 KB. Excessively large values can cause data skew, hot key issues, full bandwidth usage, and full CPU utilization. You can prevent these issues from the beginning by making sure that key values are of appropriate size.
★★★★★ Configure appropriate key names of appropriate length. Key names:
Use descriptive strings as key names. If you want to combine a database name, table name, and field name into a key name, we recommend that you use colons (:) to separate them. Example: project:user:001.
Shorten key names without compromising their descriptivity. For example, username can be shortened to u.
In Redis, braces ({}) are recognized as hash tags. In this case, if you use cluster instances, you must correctly use braces in key names to prevent data skew. For more information, see "Redis cluster specification" in the Redis documentation.
Note
For a cluster instance, if you want to manage multiple keys by running a command such as RENAME and do not use hash tags to ensure that the keys reside in the same data shard, the command cannot be run.
Length: We recommend that you keep key names within 128 bytes and preferably shorter.
★★★★★ For complex data structures that support sub-keys, you must avoid including excessive sub-keys in one key. We recommend that you include less than 1,000 sub-keys in a key.
Description
Common complex data structures include hashes, sets, Zsets, GEO structures, streams, and structures specific to Redis Enhanced Edition (Tair), such as exHash, Bloom, and TairGIS.
The time complexity of specific commands such as HGETALL is directly related to the number of sub-keys. Excessive sub-keys increase the time complexity of a command. If you frequently run commands whose time complexity is O(N) or higher, issues such as slow queries, data skew, and hot key issues occur.
★★★★☆ Use the serialization method to convert values into readable structures. The bytecode of a programming language may change when the version of the language changes. If you store naked objects (such as Java objects and C# objects) in Tair (Redis® OSS-Compatible) instances, the software stack may be difficult to upgrade. We recommend that you use the serialization method to convert values into readable structures.

SDK Usage Standards

Importance Standard Description
★★★★★ Use JedisPool or JedisCluster to connect to Tair (Redis® OSS-Compatible) instances.
Description
We recommend that you use the TairJedis client to connect to DRAM-based instances of Redis Enhanced Edition (Tair), because the TairJedis client provides encapsulation classes for new data structures. For more information, see the "Use a client to connect to an instance" topic in the Tair (Redis® OSS-Compatible) documentation.
If you use a single connection, the client cannot automatically reconnect to Tair (Redis® OSS-Compatible) instances after the connection times out. For more information about how to use JedisPool to connect to Tair (Redis® OSS-Compatible) instances, see the "Use a client to connect to an instance" and "JedisPool optimization" topics in the Tair (Redis® OSS-Compatible) documentation. and "Class JedisCluster" at javadoc.io.
★★★★☆ Design proper fault tolerance mechanisms for your clients. Network fluctuations and high usage of resources may cause connection timeouts or slow queries. To prevent these issues, you must design proper fault tolerance mechanisms for your clients.
★★★★☆ Set longer retry intervals for your clients. If retry intervals are shorter than required, such as shorter than 200 milliseconds, a large number of retries may occur in a short period of time. This can result in a service avalanche. For more information, see the "Retry mechanisms for clients" topic in the Tair (Redis® OSS-Compatible) documentation.

Command Usage Standards

Importance Standard Description
★★★★★ Avoid range queries, such as those executed by running the KEYS * command. Instead, use multiple point queries or run the SCAN command to reduce latency. Range queries may cause service interruptions, slow queries, or congestion.
★★★★★ Use extended data structures to perform complex operations. Do not use Lua scripts. Lua scripts consume a large amount of computing and memory resources and do not support multi-threading acceleration. Overly complex or improper Lua scripts may result in the exhaustion of resources.
★★★★☆ Use pipelines to reduce the round-trip time (RTT) of data packets. If you want to send multiple commands to a server and your client does not depend on each response from the server, you can use a pipeline to send the commands at a time. When you use pipelines, take note of the following items:
A client that uses pipelines exclusively connects to a server. We recommend that you establish a dedicated connection for pipeline operations to separate them from regular operations.
Each pipeline must contain a proper number of commands. We recommend that you use each pipeline to send no more than 100 commands.
★★★★☆ Correctly use Redis commands. When you use transaction commands, take note of the following limits:
Unlike transactions in relational databases, transactions in Redis cannot be rolled back.
If you want to run transaction commands on cluster instances, use hash tags to ensure that the keys to be managed are allocated in the same hash slot. You must also prevent skewed storage that hash tags may cause.
Do not encapsulate transaction commands in Lua scripts, because the compilation and loading of these commands consume a large amount of computing resources.
★★★★☆ Do not use Pub/Sub commands to perform a large number of message distribution tasks. Pub/Sub commands do not support data persistence or acknowledge mechanisms that ensure data reliability. We recommend that you do not use Pub/Sub commands to perform a large number of message distribution tasks. For example, if you use these commands to distribute a message whose size is greater than 1 KB to more than 100 subscriber clients, server resources may be exhausted and subscriber clients may not receive the message.
Note
To improve performance and balance, Tair (Redis® OSS-Compatible) is optimized for Pub/Sub commands. In cluster instances, proxy nodes calculate the hash values of commands based on channel names and allocate commands to corresponding data nodes.

O&M Management Standards

Importance Standard Description
★★★★★ Understand the impacts of different instance management operations. Configuration changes or restarts affect the status of a Tair (Redis® OSS-Compatible) instance. For example, transient connections may occur on the instance. Before you perform the preceding operations, make sure that you understand the impacts. For more information, see the "Instance states and impacts" topic in the Tair (Redis® OSS-Compatible) documentation.
★★★★★ Verify the error handling capabilities or disaster recovery logic of a client. Tair (Redis® OSS-Compatible) can monitor the health status of nodes. If a master node in an instance becomes unavailable, Tair (Redis® OSS-Compatible) automatically triggers a master-replica switchover. The roles of master and replica nodes are switched over to ensure the high availability of the instance. Before a client is generally available, we recommend that you manually trigger a master-replica switchover. This can help you verify the error handling capabilities or disaster recovery logic of the client. For more information, see the "Manually switch workloads from a master node to a replica node" topic in the Tair (Redis® OSS-Compatible) documentation.
★★★★★ Disable time-consuming or high-risk commands. In a production environment, abuse of commands may cause problems. For example, the FLUSHALL command can delete all data. The KEYS command may cause network congestion. To improve the stability and efficiency of services, you can disable specific commands to minimize risks. For more information, see the "Disable high-risk commands" topic in the Tair (Redis® OSS-Compatible) documentation.
★★★★☆ Handle pending events in a timely manner. To enhance user experience and provide improved service performance and stability, Alibaba Cloud occasionally generates pending events to upgrade the hardware and software of specific servers or replace network facilities. For example, a pending event is generated when the minor version of databases needs to be updated. After you receive an event notification from Alibaba Cloud, you can check the impacts of the event and change the scheduled time of the event to meet your business requirements. For more information, see the "View and manage scheduled events" topic in the Tair (Redis® OSS-Compatible) documentation.
★★★★☆ Configure alerts for core metrics to better monitor the status of your instances. Configure alerts for core metrics such as CPU utilization, memory usage, and bandwidth usage to monitor the status of your instances in real time. For more information, see the "Alert settings" topic in the Tair (Redis® OSS-Compatible) documentation.
★★★★☆ Use the O&M features provided by Tair (Redis® OSS-Compatible) to check the status of instances on a regular basis or troubleshoot resource usage exceptions. Use slow query logs to troubleshoot timeout issues: Slow query logs help you locate slow queries and the IP addresses of the clients that send the query requests. Slow query logs provide a reliable basis for addressing timeout issues.
View performance monitoring data: Tair (Redis® OSS-Compatible) supports a variety of performance metrics. These metrics allow you to gain insights into the status of Tair (Redis® OSS-Compatible) instances and troubleshoot issues at the earliest opportunity.
Create a diagnostic report: Diagnostic reports help you evaluate the status of Tair (Redis® OSS-Compatible) instances, such as performance level, skewed requests, and slow query logs. Diagnostic reports also help you identify exceptions on Tair (Redis® OSS-Compatible) instances.
Use the offline key analysis feature: You can use the offline key analysis feature to identify large keys in Tair (Redis® OSS-Compatible) instances. You can also learn about the memory usage, distribution, and TTL of large keys.
Use the real-time key statistics feature: The real-time key statistics feature helps you identify hot keys in Tair (Redis® OSS-Compatible) instances and allows you to further optimize your databases.
★★★☆☆ Enable the audit log feature and evaluate audit logs. After you enable the audit log feature, the audit statistics about write operations are recorded. Tair (Redis® OSS-Compatible) also allows you to query, analyze, and export audit logs. These features help you monitor the security and performance of your Tair (Redis® OSS-Compatible) instances. For more information, see the "Enable the audit log feature" topic in the Tair (Redis® OSS-Compatible) documentation.
Important
After you enable the audit log feature, the performance of Tair (Redis® OSS-Compatible) instances may degrade by 5% to 15%. The actual performance degradation varies based on the number of write operations or audit operations. If your business expects a large number of write operations, we recommend that you enable the audit log feature only when you perform O&M operations, such as troubleshooting. This helps you prevent performance degradation.

Key Points

Large Keys

A large key is not an excessively lengthy key but a key associated with slow query commands.

For large keys of the STRING type, slow queries are caused by the large size of their values.

For large keys of other types, slow queries are caused by the large number of members in the keys. These slow queries have high time complexity.

38
39

Identify large keys

For more information, see the Identify and handle large keys and hotkeys in the Tair (Redis® OSS-Compatible) documentation.

Optimize large keys

To optimize large keys, you can split them into multiple keys.

  1. For keys of the STRING type, control the size of values within 10 KB at the business layer. For excessively large values, you can use serialization or compression algorithms to process them. Common serialization algorithms include protostuff, Kryo, and FST.
  2. For keys of collection types, split the keys into multiple keys that have an appropriate number of members.

Stress Testing

1.  Check whether data is skewed on each shard.

2.  Check whether large keys and hot keys exist by using the CloudDBA module on the instance details page.

  • During stress testing, view the information on the Top Key Statistics and Slow Queries pages.
  • After stress testing, view the information on the Offline Key Analysis page and create a diagnostic report that covers the stress testing period on the Diagnostic Reports page.

3.  Check the trends of CPU utilization, memory usage, and inbound and outbound bandwidth usage within a cache cycle.

4.  View the audit logs to check whether the write logs conform to the code logic.

Common O&M Commands

When an incident occurs, you can use the following commands to troubleshoot issues and record related data.

CLIENT LIST

40

INFO MEMORY

41

MEMORY USAGE

42


*Redis is a registered trademark of Redis Ltd. Any rights therein are reserved to Redis Ltd. Any use by Alibaba Cloud is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Alibaba Cloud.

0 1 0
Share on

ApsaraDB

493 posts | 145 followers

You may also like

Comments