×
Community Blog From 'Firefighting' to 'Prevention': Building a Proactive Defense System for Redis Big Keys and Hot Keys

From 'Firefighting' to 'Prevention': Building a Proactive Defense System for Redis Big Keys and Hot Keys

This article introduces using Alibaba Cloud DAS and SLS to build a proactive, time-series audit system for preventing and governing Redis Big Keys and Hot Keys.

Target Audience
This guide is designed for technical professionals, database administrators (DBAs), site reliability engineers (SREs), backend architects, system architects and and DevOps engineers who manage high-concurrency Tair/Redis® clusters. It is particularly valuable for teams struggling with intermittent latency spikes, memory instability, and challenges related to large keys and hot keys in high-concurrency environments, or those seeking to move beyond basic monitoring toward data-driven, long-term governance. It offers insights into proactive governance strategies to mitigate risks and streamline operations.

Overview

Redis instability often stems from Big Keys and Hot Keys, yet traditional monitoring offers only static snapshots with limited history. By integrating Alibaba Cloud DAS (Database Autonomy Service) with SLS (Simple Log Service), you can transform discrete monitoring data into a long-term, queryable time-series audit system. This enables proactive risk prevention, precise root cause analysis, and a closed-loop governance strategy.

1. Core Challenges and Governance Pain Points

Redis governance must evolve from static monitoring to dynamic auditing. The core risks of Big Keys and Hot Keys stem from localized resource overload and single-thread blocking. Without long-term, granular data, traditional monitoring creates significant blind spots and related issues remain hidden until they cause outages.

1.1. Core Risks: Mechanisms and Impact

In high-concurrency production environments, Big Keys and Hot Keys are the primary triggers for cluster jitter or even collapse.

  • Big Keys (Resource Blocking):
    Keys that consume significantly more memory than average or contain an excessive number of members.

    • Main Thread Blocking: Redis uses a single-threaded execution model. Reading, deleting, or syncing large data structures (e.g., Hashes with tens of thousands of members) causes millisecond-to-second CPU blocks, leading request backlogs and slow queries.
    • Stability Crisis: Big Keys increase memory fragmentation. During master-secondary synchronization (PSYNC), they can cause buffer overflows, leading frequent disconnections or even OOM (Out Of Memory) crashes.
  • Hot Keys (Single-Point Performance Bottleneck):
    Specific keys accessed with extremely high frequency within a short period.

    • Traffic Skew: Extreme traffic concentrates on a single shard, causing its CPU or network bandwidth to hit physical limits rapidly.
    • The Scaling Trap: Since pressure is isolated to a single node, horizontal scaling (adding shards) does not alleviate the load. This often leads to single-point failure and cascading avalanches.

1.2. The Governance Challenge: Static Snapshots vs. Time-Series Auditing

Basic Top Key monitoring usually provides only static snapshots (e.g., redis-cli --bigkeys) with limited historical retention (typically 7 days). This creates significant logical gaps in complex production environments:

  • Blind Spots in Time Dimension: Limited retention prevents us from distinguishing whether a Big Key is a sudden traffic spike or a "boiling frog" accumulation caused by business logic bugs over 30+ days.
  • No Closed-Loop Validation: Without long-term time-series data, we cannot verify if a cleaned-up Key silently reappears after a week.

In actual O&M, DBAs need answers to high-precision, long-cycle audit questions:

  • Incremental Identification: Which abnormal Big Keys emerged in the last 24 hours?
  • Trend Analysis: Is a Hot Key’s QPS a transient pulse (e.g., flash sale) or a sustained decline? This dictates the governance strategy.
  • Governance Verification: After code refactoring or key cleanup did single-node memory usage and fragmentation rates drop as expected?

Simply put: Basic snapshot monitoring only reaches the accident scene. Time-series auditing reconstructs the truth and predicts future risks.

2. The Solution: DAS + SLS Integration

To shift from "firefighting" to "proactive governance", Alibaba Cloud offers a deep integration of Database Autonomy Service (DAS) and Simple Log Service (SLS). This solution transforms discrete monitoring snapshots into traceable, aggregatable time-series metrics.

2.1. Solution Value Proposition

This deeply integrated solution provides the following technical benefits.

  • Standardization & Persistence: DAS automatically collects Redis engine-level metrics (Big Keys every 60s, Hot Keys every 10s), converts them to Prometheus format, and pushes them to SLS. This breaks the 7-day retention limit, enabling long-term storage.
  • Programmable Governance (SQL-Driven): Leveraging SLS’s SQL analysis capabilities moves governance from "visual inspection" to "computational precision."

    • Precise Tracing: "Find the Top 10 Keys whose memory usage increased by >50% hour-over-hour."
    • Multi-dimensional Aggregation: "Compare QPS evolution curves of specific Hot Keys with the same period yesterday to help you decide whether to use a local cache or split the key."

2.2. Architecture Overview

The data flow consists of four layers, ensuring seamless collection, storage, and analysis.

1

Layer Component/Product Description
Data Collection DAS O&M Service Collects Top Key data from Tair/Redis® instances. - Big Key Interval: 60s - Hot Key Interval: 10s
Data Delivery DAS O&M Service Delivers Top Key data in Prometheus format.
Storage SLS MetricStore Stores data in the redis_top_key_log time-series Metricstore.
Analysis SLS SQL Query Supports Prometheus Query Language (PromQL), SQL analysis, and visualization.

2.3. Data Structure

Key fields delivered to SLS include:

Field Description
__labels__ Label set containing instance_id, node_id, key, key_type, topkey_type.
__value__ Metric value (Size in bytes for Big Keys; QPS for Hot Keys).
__time_nano__ Timestamp (millisecond precision).
__name__ Metric name: bigKeyItemCnt (Big Key) or hotKeyQpsLowerBound (Hot Key).

2.4. Core Capabilities

Based on the data delivered to SLS, you can easily perform:

  • Key Lifecycle Tracking (Count Changes): Identify new, disappeared, or persistent Keys.
  • Size Trend Analysis: Monitor Big Key size fluctuations over time.
  • QPS Trend Analysis: Visualize Hot Key QPS changes.
  • Top N Ranking: Quickly locate critical Keys sorted by size or QPS.

3. How to Implement the Solution

This section outlines practical steps for implementing the DAS and SLS solution for effective key governance.

3.1. Prepare the Environment

This solution requires:

  1. A cloud-managed Tair/Redis® instance.
  2. DAS O&M Service enabled (Paid feature).

    • Note: Without this service, Top Key data is only available as real-time snapshots with 7-day history. Enabling it allows automatic synchronization to SLS with retention up to 10 years.
    • Enable via the Console with one click.

Activate Alibaba Cloud DAS O&M Service with one click in the console as follows:

2
3
4

3.2. Verify Data Delivery

After configuration, check the SLS Console. If you see the redis_top_key_log MetricStore containing metrics bigKeyItemCnt and hotKeyQpsLowerBound, delivery is successful.

5

Data Retention Policy: Default is 30 days. You can modify this to up to 10 years depending on compliance and audit needs. Change the retention period as follows:

6

3.3. Analyze Top Keys & Governance

SLS supports two query languages for analysis: PromQL (for trends) and SQL (for details).

3.3.1. PromQL: Overall Trend Analysis

Scenarios: Use this method when you need a visualization to view Big/Hot Key trends over a period to roughly estimate governance effectiveness.

Basic usage (PromQL statements):

  • Big Key Trend: bigKeyItemCnt{instance_id="xxx", node_id="xxx"}
  • Hot Key Trend: hotKeyQpsLowerBound{instance_id="xxx", node_id="xxx"}

Steps: Select time range (e.g., last 15 mins) -> Input PromQL -> View trend graph.

7
8

3.3.2. SQL: Detailed Status & Lifecycle Analysis

Scenarios: Use this method when understanding the overall trend is insufficient. You can drill down into detailed data, query the quantity changes of Top Keys (e.g, deep diving into specific Keys to identify increments, decrements), or compute the week-over-week and month-over-month comparisons of Top Keys over a period of time.

Common Uses: Use SQL queries to find out how many new Top Keys appeared in a specific period and how certain Top Keys formed. For example, did they appear suddenly and persist, or appear suddenly and disappear?

Logic of Design:

For a given time range, we determine a Key’s status by comparing its first_seen and last_seen timestamps against the query window boundaries (t_start,t_end, adding a tolerance buffer (2 collection cycles) to avoid boundary errors.

Status Definitions:

  • existing: Persisted throughout the period.
  • new: Added during the period and still exists.
  • disappeared: Existed at start but vanished during the period.
  • new -> disappeared: Appeared and vanished within the period.
Timeline:  t_start                                          t_end
            |──────────────────────────────────────────|
            
existing:   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■|  Present at start and end
new:                    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■|  Appeared midway, present at end
disappeared:|■■■■■■■■■■■■■■■■■■■■|                        Present at start, disappeared midway
new->disappeared:             |■■■■■■■■■■■■|              Appeared midway, disappeared midway
  • first_seen ≤ t_start + tolerance indicates that the Key already existed at the start of the period (within the tolerance).
  • last_seen ≥ t_end - tolerance indicates that the key still exists at the end of the time period (within the tolerance).

Tolerance: Add 2 collection cycles to avoid boundary misjudgment.

  • Big Key Cycle: 60s → Tolerance: 120s.
  • Hot Key Cycle: 10s → Tolerance: 20s.

The following section uses Top Key change analysis as a typical scenario to show how to query Top Keys using SQL.

3.3.2.1. Big Key Lifecycle Analysis

Display of results:

9
10

SQL Implementation:

Based on the logic of design, write the following SQL to detect the new, missing, and persistent states of Big Keys within a specified time period:

* | select 
    instance_id,
    node_id,
    key_name,
    key_type,
    case 
        when first_seen <= t_start + 120 and last_seen >= t_end - 120 then 'existing'
        when first_seen > t_start + 120 and last_seen < t_end - 120 then 'new -> disappeared'
        when first_seen > t_start + 120 then 'new'
        when last_seen < t_end - 120 then 'disappeared'
    end as status,
    case 
        when first_seen > t_start + 120 
        then date_format(from_unixtime(first_seen), '%Y-%m-%d %H:%i:%s')
    end as new_at,
    case 
        when last_seen < t_end - 120 
        then date_format(from_unixtime(last_seen), '%Y-%m-%d %H:%i:%s')
    end as disappeared_at,
    round(max_size / 1024.0, 2) as max_size_kb,
    round(latest_size / 1024.0, 2) as latest_size_kb
from (
  select 
  element_at(__labels__, 'instance_id') as instance_id,
  element_at(__labels__, 'node_id') as node_id,
  element_at(__labels__, 'key') as key_name,
  arbitrary(element_at(__labels__, 'key_type')) as key_type,
  min(__time_nano__ / 1000000) as first_seen,
  max(__time_nano__ / 1000000) as last_seen,
  max(__value__) as max_size,
  max_by(__value__, __time_nano__) as latest_size,
  arbitrary(t_start) as t_start,
  arbitrary(t_end) as t_end
  from "redis_top_key_log.prom"
  cross join (
    select 
    to_unixtime(timestamp '2026-03-27 00:00:00') as t_start,      -- Set Start Time
    to_unixtime(timestamp '2026-03-27 15:00:00') as t_end         -- Set End Time
  ) t
  where __name__ = 'bigKeyItemCnt'
  and element_at(__labels__, 'instance_id') = '<your_instance_id>'     
  and __time_nano__ / 1000000 >= t_start
  and __time_nano__ / 1000000 <  t_end
  group by instance_id, node_id, key_name
)
order by instance_id, node_id, status, key_name
  • Parameters:

    • Time Range: Modify t_start and t_end to define your analysis window.
    • Instance ID: Replace <your_instance_id>with your actual Instance ID.
    • Tolerance: 120 seconds (keep 2 x 60s collection cycle = 120s for Big Keys).
3.3.2.2. Hot Key Lifecycle Analysis

Display of results:

11
12

QPS Fields in the Results:

  • max_qps: Max QPS of the key in period.
  • avg_qps: Average QPS of the key in period.
  • latest_qps: QPS at last collection.

SQL Implementation:

Based on the logic of design, write the following SQL to detect the new, missing, and persistent states of Hot Keys within a specified time period:

* | select 
    instance_id,
    node_id,
    key_name,
    key_type,
    case 
        when first_seen <= t_start + 20 and last_seen >= t_end - 20 then 'existing'
        when first_seen > t_start + 20 and last_seen < t_end - 20 then 'new -> disappeared'
        when first_seen > t_start + 20 then 'new'
        when last_seen < t_end - 20 then 'disappeared'
    end as status,
    case 
        when first_seen > t_start + 20 
        then date_format(from_unixtime(first_seen), '%Y-%m-%d %H:%i:%s')
    end as new_at,
    case 
        when last_seen < t_end - 20 
        then date_format(from_unixtime(last_seen), '%Y-%m-%d %H:%i:%s')
    end as disappeared_at,
    round(max_qps, 2) as max_qps,
    round(avg_qps, 2) as avg_qps,
    round(latest_qps, 2) as latest_qps
from (
  select 
  element_at(__labels__, 'instance_id') as instance_id,
  element_at(__labels__, 'node_id') as node_id,
  element_at(__labels__, 'key') as key_name,
  arbitrary(element_at(__labels__, 'key_type')) as key_type,
  min(__time_nano__ / 1000000) as first_seen,
  max(__time_nano__ / 1000000) as last_seen,
  max(__value__) as max_qps,
  avg(__value__) as avg_qps,
  max_by(__value__, __time_nano__) as latest_qps,
  arbitrary(t_start) as t_start,
  arbitrary(t_end) as t_end
  from "redis_top_key_log.prom"
  cross join (
    select 
    to_unixtime(timestamp '2026-03-27 00:00:00') as t_start,       -- Set Start Time
    to_unixtime(timestamp '2026-03-27 15:00:00') as t_end          -- Set End Time
  ) t
  where __name__ = 'hotKeyQpsLowerBound'
  and element_at(__labels__, 'instance_id') = '<your_instance_id>'    
  and __time_nano__ / 1000000 >= t_start
  and __time_nano__ / 1000000 <  t_end
  group by instance_id, node_id, key_name
)
order by instance_id, node_id, status, key_name

Parameters:

  • Time Range: Modify t_start and t_end to define your analysis window.
  • Instance ID: Replace <your_instance_id>with your actual Instance ID.
  • Tolerance: Keep default (20s) for Hot Keys.

4. Advanced Usage Recommendations

To maximize the value of this solution, we recommend leveraging SLS's scheduled inspection and alerting features for automated operations.

4.1. Configure Proactive Alerts

Scenarios: Use the Alert Configuration of SLS to receive proactive notifications when the instance experiences the following conditions.

Proactively notify your team when:

  • The count of Big Keys exceeds a threshold.
  • Hot Key QPS spikes suddenly (e.g., > 10,000 QPS).
  • New Big/Hot Keys are detected.

References: https://www.alibabacloud.com/help/sls/alarm-settings-quick-start

Configuration method: Configure alert rules, notification recipients, and notification policies through the Alert Center.

13
14

Example Alert SQL (QPS Spike > 10,000):

The following is the sample SQL in the preceding figure. It triggers an alert when the QPS of a key suddenly exceeds 10,000.

| select 
  instance_id,
  node_id,
  key_name,
  key_type,
  round(latest_qps, 2) as latest_qps
from (
select 
element_at(labels, 'instance_id') as instance_id,
element_at(labels, 'node_id') as node_id,
element_at(labels, 'key') as key_name,
arbitrary(element_at(labels, 'key_type')) as key_type,
max_by(value, time_nano) as latest_qps
from "redis_top_key_log.prom"
where name = 'hotKeyQpsLowerBound'
and element_at(labels, 'instance_id') = '<your_instance_id>'      
and element_at(labels, 'key') = '<key_name>'    
group by instance_id, node_id, key_name
)
order by instance_id, node_id

4.2. Build Inspection Dashboards

Save common queries as SLS Dashboards for daily health checks and troubleshooting. This provides a visual overview of cluster health and key trends.

References: https://www.alibabacloud.com/help/sls/dashboard-overview

Display of results:

15
16

Configuration method:

17

5. Summary

The combined solution of Database Autonomy Service (DAS) and Simple Log Service (SLS) enables full traceability of Tair/Redis® keys. This prevents and cleans up threats before they break out to impact your business, and makes the management of Big/Hot Keys in Tair/Redis® easier.

Comparison: Basic Monitoring vs. DAS+SLS Solution

Comparison Item Basic Monitoring on Redis® Console DAS + SLS Solution
View Current Top Keys
Historical Trend Analysis ✅ (PromQL + SQL)
Identify New/Disappeared Keys ✅ (via status field)
Big Key Size Changes ✅ (max_size_kb / latest_size_kb)
Hot Key QPS Changes ✅ (max_qps / avg_qps / latest_qps)
Custom Time Range Query ✅ (Parameterized via SQL)
Long-term Data Retention ✅ (Up to 10 Years)

Solution Cost:

  • DAS: Standard price is 45 USD/instance/month; Promotional price is 15 USD/instance/month.
  • SLS: Pay-as-you-go based on write volume, storage duration, and size.

If you are interested in this solution or have any questions, contact us.

0 1 0
Share on

ApsaraDB

619 posts | 184 followers

You may also like

Comments