×
Community Blog Ending the Cloud-Native Memory "Black Box": Intelligent Operations with SysOM MCP and ACK AI Assistant

Ending the Cloud-Native Memory "Black Box": Intelligent Operations with SysOM MCP and ACK AI Assistant

This article shows how ACK AI Assistant and SysOM MCP enable single-conversation, full-stack cloud-native memory troubleshooting via Model Context Protocol.

By Jietao Xiao and Shichun Feng

cover

Frequent Memory Issues in Cloud-Native Scenarios: The Never-ending Cycle of Expansion

In the cloud-native era, while Kubernetes (K8s) has become the gold standard for container orchestration, its complex resource management continues to challenge O&M teams. Node and container Out of Memory (OOM) events and abnormal memory usage are particularly prevalent, manifesting in scenarios such as:

Persistent High Memory Usage: Nodes frequently hover near memory pressure thresholds, triggering Kubelet eviction mechanisms. This forces pod migrations and compromises business stability. Worse, high memory pressure negatively impacts node scheduling scores, preventing new pods from being deployed effectively.

Frequent Container OOM Events: Pods are terminated by cgroups for exceeding memory limits (status: OOMKilled), leading to frequent service restarts that are difficult to trace to a root cause.

Silent Application Memory Leaks: Applications with memory leaks may pass short-term stress tests but gradually consume more memory over days or weeks until an OOM occurs. These issues are highly elusive and often only surface in production.

Imbalanced Resource Quotas: Incorrect requests/limits configurations are common. Under-provisioning leads to frequent OOM evictions, while over-provisioning results in massive resource waste and reduced scheduling efficiency. Determining the "optimal value" is highly dependent on specific business logic and historical data.

The Difficulty of Troubleshooting

Cloud-native memory issues are notoriously difficult to debug, typically requiring cross-functional experts and days of investigation to identify the root cause and find a suitable fix.

Pain point Description
Fragmented Diagnostic Chain K8s events, pod descriptions, node monitoring, container logs, and kernel OOM logs are scattered across different systems. Site Reliability Engineers (SREs) must constantly switch between multiple interfaces.
Inconsistent Data Dimensions A single issue involves Prometheus metrics, container metadata, Linux kernel metrics, and application logs. There is no unified "narrative" for the problem.
Experience-Dependent Analysis Traditional tools provide "symptoms" rather than "conclusions," leaving final decisions to a handful of experts who "know the system."

A New Solution: ACK AI Assistant with ACK & SysOM MCP

To address these pain points, Alibaba Cloud’s Container Service team has introduced the Computing AI Assistant (ACK AI Assistant) and the ACK MCP toolset. In collaboration with the Alibaba Cloud Basic Software team, the SysOM MCP toolset was developed. By integrating SysOM’s professional system diagnostic capabilities into the ACK AI Assistant via the Model Context Protocol (MCP), users can now resolve cloud-native memory issues with a single query.

ACK AI Assistant + ACK MCP: An "Intelligent SRE" for Cloud-Native Business

The ACK AI Assistant is an intelligent operations helper built on Alibaba Cloud Container Service for Kubernetes (ACK).

It deeply integrates OS capabilities to provide an intelligent O&M experience across the full container lifecycle (Day 0 to Day 2). Based on "Well-Architected" principles, it provides best-practice guidance for stability, cost, security, and performance.

Core capabilities include:

Intelligent Diagnosis: Full environment awareness and multi-turn dialogue to supplement context. It coordinates multiple expert Agents to perform "joint consultations," combining observability data with domain expertise to close the loop from anomaly detection to one-click remediation.

Cluster Optimization: Automatically analyzes cost, security, architecture, and elasticity configurations to generate actionable optimization plans with predicted outcomes.

Smart Health Checks: Performs dynamic anomaly detection across clusters, nodes, workloads, networks, and storage. It leverages Large Language Models (LLMs) and algorithms to move beyond traditional threshold-based alerting.

Automated AIOps: Supports fully automated AIOps workflows for complex scenarios, with future goals for automated application creation and resource management (self-healing).

1

ACK also provides the open-source ack-mcp-server toolset on GitHub, allowing users to build their own SRE agents for ACK and Kubernetes environments: https://github.com/aliyun/alibabacloud-ack-mcp-server/

SysOM MCP: A "Specialist Doctor" for Deep OS Diagnosis

The SysOM MCP project includes over 20 production-grade diagnostic tools for nodes and containers:

Memory Analysis: Full-spectrum memory diagnosis, application memory profiling, and OOM diagnosis.
IO Diagnosis: One-click I/O diagnosis and I/O traffic analysis.
Network Troubleshooting: Network packet loss and jitter diagnosis.
Scheduling Diagnosis: System load and scheduling jitter diagnosis.
Disk Diagnosis: Disk analysis and diagnostics.
System Crash Diagnosis: Crash analysis (dmesg analysis) and in-depth vmcore analysis.

For memory issues, SysOM memory tools provide full-spectrum analysis spanning from kernel to application memory, covering over 10 memory anomaly scenarios:

2_

The Power of Integration: Closing the Loop on Memory Diagnosis

Why Combine Them?

It appears we already have two robust tools—one with business-level insights and the other with deep kernel awareness. However, for the cloud-native memory challenges highlighted here, neither is sufficient on its own. Effective troubleshooting demands a synergy of both cloud-native and OS expertise—this necessity is exactly why we must bring them together.

Tool

Limitation

Details

ACK AI Assistant

Lacks Underlying Data

Prometheus only shows high-level metrics—such as container Resident Set Size (RSS) and node available memory—without process-level or kernel-level details.

Missing Diagnostic Rules

Relies on Retrieval-Augmented Generation (RAG) for docs. Without "executable rules," it can only provide a list of "possible causes" for deep issues.

Difficulty Determining Root Cause Analysis (RCA)

It’s hard to distinguish between "app leaks vs. low limits vs. noisy neighbors" based on monitoring metrics alone.

SysOM MCP

Lacks K8s Metadata

Unaware of native K8s objects (Pods, Deployments, DaemonSets). Cannot associate kernel data with business chains or deployment patterns.

Lacks Log Context

Cannot use application logs to determine what the business was doing during a memory spike.

Disconnected from Metrics

Limited awareness of time-series metrics (Prometheus), making historical trend analysis difficult.

Full-Dimensional Data Integration

Through the ACK MCP and SysOM MCP toolchains, the ACK AI Assistant achieves:

Automated Metadata Association: A single question allows the AI to automatically link Namespace → Deployment → Pod → Node → Instance Specs, mapping SysOM’s process data to K8s objects. SysOM explains "What" is happening (kernel-level RCA), while ACK MCP explains "Why" (K8s configuration context).

Fusion of Logs, Events, and Metrics: When an OOM occurs, the system automatically pulls container logs, K8s events, Prometheus metrics, and audit logs. SysOM provides the "current state" (memory snapshot) , Prometheus provides "historical trends" (when it started), and audit logs provide "change events" (correlation with releases) . Cross-referencing these allows the AI to distinguish between a traffic surge and a version defect.

Real-World Cases

CASE 1: Inconsistency between kubectl top node and Node Monitoring

Problem Scenario:

A customer found that kubectl top node showed 60% memory usage, while the cloud monitoring console showed 85%—a discrepancy of over 20%. This made it impossible to judge actual load or decide on scaling.

Traditional Solution:

Manually consult experts, investigate calculation formulas, check for hidden memory usage, and reconcile the differences.

With ACK AI Assistant:

3
4

CASE 2: Java Application Pods Suffering Frequent OOMKilled Events

Problem Scenario:

After running in production for some time, a Netty service began to experience frequent OOMKilled restarts. The container was configured with a 4 GiB memory limit, and the JVM heap was set to -Xmx3g, which theoretically should have been sufficient. However, the pod continued to be terminated by OOM every few hours, leading to business teams complaints regarding service instability.

Traditional Solution:

Java developers use various profiling tools (jmap, jstat) to find the memory leak, leading to long discussions on JVM parameters.

With ACK AI Assistant:

5
6

CASE 3: Improper EmptyDir Usage Leading to Pod OOMKilled

Problem Scenario:

A data processing pod was OOMKilled, but logs showed no anomalies and app memory usage was well below limits.

Traditional Solution:

SSH into the node, locate the cgroup path, manually parse memory.stat, and cross-reference with Pod specs. This requires deep kernel knowledge and multiple system switches.

With ACK AI Assistant:

7

8

Conclusion

By combining ACK AI Assistant with SysOM & ACK MCP, cloud-native memory management evolves from "experience-based" to a standardized, rule-driven, and tool-supported closed-loop capability.

This isn't just a stacking of tools; it's a deep fusion of the "Cloud-Native Perspective" and the "OS Perspective," giving SREs a complete diagnostic report and actionable recommendations from the business layer down to the kernel with just one sentence.

References:

ACK AI Assistant Documentation: https://www.alibabacloud.com/help/ack/ack-managed-and-ack-dedicated/user-guide/use-container-ai-assistant-for-troubleshooting-and-intelligent-q-a

Official Open-Source ACK MCP Toolset:

🌟 GitHub Link: https://github.com/aliyun/alibabacloud-ack-mcp-server/blob/master/README.md

SysOM MCP:

🌟 GitHub Link: https://github.com/alibaba/sysom_mcp

Operating System Console: https://help.aliyun.com/alinux/product-overview/what-is-the-operating-system-console

0 1 0
Share on

OpenAnolis

107 posts | 6 followers

You may also like

Comments

OpenAnolis

107 posts | 6 followers

Related Products