By Jietao Xiao and Shichun Feng

In the cloud-native era, while Kubernetes (K8s) has become the gold standard for container orchestration, its complex resource management continues to challenge O&M teams. Node and container Out of Memory (OOM) events and abnormal memory usage are particularly prevalent, manifesting in scenarios such as:
• Persistent High Memory Usage: Nodes frequently hover near memory pressure thresholds, triggering Kubelet eviction mechanisms. This forces pod migrations and compromises business stability. Worse, high memory pressure negatively impacts node scheduling scores, preventing new pods from being deployed effectively.
• Frequent Container OOM Events: Pods are terminated by cgroups for exceeding memory limits (status: OOMKilled), leading to frequent service restarts that are difficult to trace to a root cause.
• Silent Application Memory Leaks: Applications with memory leaks may pass short-term stress tests but gradually consume more memory over days or weeks until an OOM occurs. These issues are highly elusive and often only surface in production.
• Imbalanced Resource Quotas: Incorrect requests/limits configurations are common. Under-provisioning leads to frequent OOM evictions, while over-provisioning results in massive resource waste and reduced scheduling efficiency. Determining the "optimal value" is highly dependent on specific business logic and historical data.
Cloud-native memory issues are notoriously difficult to debug, typically requiring cross-functional experts and days of investigation to identify the root cause and find a suitable fix.
| Pain point | Description |
|---|---|
| Fragmented Diagnostic Chain | K8s events, pod descriptions, node monitoring, container logs, and kernel OOM logs are scattered across different systems. Site Reliability Engineers (SREs) must constantly switch between multiple interfaces. |
| Inconsistent Data Dimensions | A single issue involves Prometheus metrics, container metadata, Linux kernel metrics, and application logs. There is no unified "narrative" for the problem. |
| Experience-Dependent Analysis | Traditional tools provide "symptoms" rather than "conclusions," leaving final decisions to a handful of experts who "know the system." |
To address these pain points, Alibaba Cloud’s Container Service team has introduced the Computing AI Assistant (ACK AI Assistant) and the ACK MCP toolset. In collaboration with the Alibaba Cloud Basic Software team, the SysOM MCP toolset was developed. By integrating SysOM’s professional system diagnostic capabilities into the ACK AI Assistant via the Model Context Protocol (MCP), users can now resolve cloud-native memory issues with a single query.
The ACK AI Assistant is an intelligent operations helper built on Alibaba Cloud Container Service for Kubernetes (ACK).
It deeply integrates OS capabilities to provide an intelligent O&M experience across the full container lifecycle (Day 0 to Day 2). Based on "Well-Architected" principles, it provides best-practice guidance for stability, cost, security, and performance.
Core capabilities include:
Intelligent Diagnosis: Full environment awareness and multi-turn dialogue to supplement context. It coordinates multiple expert Agents to perform "joint consultations," combining observability data with domain expertise to close the loop from anomaly detection to one-click remediation.
Cluster Optimization: Automatically analyzes cost, security, architecture, and elasticity configurations to generate actionable optimization plans with predicted outcomes.
Smart Health Checks: Performs dynamic anomaly detection across clusters, nodes, workloads, networks, and storage. It leverages Large Language Models (LLMs) and algorithms to move beyond traditional threshold-based alerting.
Automated AIOps: Supports fully automated AIOps workflows for complex scenarios, with future goals for automated application creation and resource management (self-healing).

ACK also provides the open-source ack-mcp-server toolset on GitHub, allowing users to build their own SRE agents for ACK and Kubernetes environments: https://github.com/aliyun/alibabacloud-ack-mcp-server/
The SysOM MCP project includes over 20 production-grade diagnostic tools for nodes and containers:
• Memory Analysis: Full-spectrum memory diagnosis, application memory profiling, and OOM diagnosis.
• IO Diagnosis: One-click I/O diagnosis and I/O traffic analysis.
• Network Troubleshooting: Network packet loss and jitter diagnosis.
• Scheduling Diagnosis: System load and scheduling jitter diagnosis.
• Disk Diagnosis: Disk analysis and diagnostics.
• System Crash Diagnosis: Crash analysis (dmesg analysis) and in-depth vmcore analysis.
For memory issues, SysOM memory tools provide full-spectrum analysis spanning from kernel to application memory, covering over 10 memory anomaly scenarios:

It appears we already have two robust tools—one with business-level insights and the other with deep kernel awareness. However, for the cloud-native memory challenges highlighted here, neither is sufficient on its own. Effective troubleshooting demands a synergy of both cloud-native and OS expertise—this necessity is exactly why we must bring them together.
|
Tool |
Limitation |
Details |
|
ACK AI Assistant |
Lacks Underlying Data |
Prometheus only shows high-level metrics—such as container Resident Set Size (RSS) and node available memory—without process-level or kernel-level details. |
|
Missing Diagnostic Rules |
Relies on Retrieval-Augmented Generation (RAG) for docs. Without "executable rules," it can only provide a list of "possible causes" for deep issues. |
|
|
Difficulty Determining Root Cause Analysis (RCA) |
It’s hard to distinguish between "app leaks vs. low limits vs. noisy neighbors" based on monitoring metrics alone. |
|
|
SysOM MCP |
Lacks K8s Metadata |
Unaware of native K8s objects (Pods, Deployments, DaemonSets). Cannot associate kernel data with business chains or deployment patterns. |
|
Lacks Log Context |
Cannot use application logs to determine what the business was doing during a memory spike. |
|
|
Disconnected from Metrics |
Limited awareness of time-series metrics (Prometheus), making historical trend analysis difficult. |
Through the ACK MCP and SysOM MCP toolchains, the ACK AI Assistant achieves:
• Automated Metadata Association: A single question allows the AI to automatically link Namespace → Deployment → Pod → Node → Instance Specs, mapping SysOM’s process data to K8s objects. SysOM explains "What" is happening (kernel-level RCA), while ACK MCP explains "Why" (K8s configuration context).
• Fusion of Logs, Events, and Metrics: When an OOM occurs, the system automatically pulls container logs, K8s events, Prometheus metrics, and audit logs. SysOM provides the "current state" (memory snapshot) , Prometheus provides "historical trends" (when it started), and audit logs provide "change events" (correlation with releases) . Cross-referencing these allows the AI to distinguish between a traffic surge and a version defect.
Problem Scenario:
A customer found that kubectl top node showed 60% memory usage, while the cloud monitoring console showed 85%—a discrepancy of over 20%. This made it impossible to judge actual load or decide on scaling.
Traditional Solution:
Manually consult experts, investigate calculation formulas, check for hidden memory usage, and reconcile the differences.
With ACK AI Assistant:


Problem Scenario:
After running in production for some time, a Netty service began to experience frequent OOMKilled restarts. The container was configured with a 4 GiB memory limit, and the JVM heap was set to -Xmx3g, which theoretically should have been sufficient. However, the pod continued to be terminated by OOM every few hours, leading to business teams complaints regarding service instability.
Traditional Solution:
Java developers use various profiling tools (jmap, jstat) to find the memory leak, leading to long discussions on JVM parameters.
With ACK AI Assistant:


Problem Scenario:
A data processing pod was OOMKilled, but logs showed no anomalies and app memory usage was well below limits.
Traditional Solution:
SSH into the node, locate the cgroup path, manually parse memory.stat, and cross-reference with Pod specs. This requires deep kernel knowledge and multiple system switches.
With ACK AI Assistant:


By combining ACK AI Assistant with SysOM & ACK MCP, cloud-native memory management evolves from "experience-based" to a standardized, rule-driven, and tool-supported closed-loop capability.
This isn't just a stacking of tools; it's a deep fusion of the "Cloud-Native Perspective" and the "OS Perspective," giving SREs a complete diagnostic report and actionable recommendations from the business layer down to the kernel with just one sentence.
ACK AI Assistant Documentation: https://www.alibabacloud.com/help/ack/ack-managed-and-ack-dedicated/user-guide/use-container-ai-assistant-for-troubleshooting-and-intelligent-q-a
Official Open-Source ACK MCP Toolset:
🌟 GitHub Link: https://github.com/aliyun/alibabacloud-ack-mcp-server/blob/master/README.md
SysOM MCP:
🌟 GitHub Link: https://github.com/alibaba/sysom_mcp
Operating System Console: https://help.aliyun.com/alinux/product-overview/what-is-the-operating-system-console
How Does SysOM Agent Locate the Culprit in 3 Minutes When CPU Jitter Occurs Unexpectedly?
107 posts | 6 followers
FollowOpenAnolis - March 5, 2026
OpenAnolis - May 27, 2026
OpenAnolis - March 25, 2026
OpenAnolis - April 20, 2026
Alibaba Cloud Native Community - May 26, 2026
Alibaba Container Service - July 25, 2025
107 posts | 6 followers
Follow
Alibaba Cloud Model Studio
A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models
Learn More
Container Service for Kubernetes
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn More
Container Compute Service (ACS)
A cloud computing service that provides container compute resources that comply with the container specifications of Kubernetes
Learn More
EasyDispatch for Field Service Management
Apply the latest Reinforcement Learning AI technology to your Field Service Management (FSM) to obtain real-time AI-informed decision support.
Learn MoreMore Posts by OpenAnolis