When a Kubernetes node or pod runs out of memory, symptoms are often intermittent and hard to trace: applications crash unexpectedly, pods get evicted, or cluster performance degrades without a clear cause. The memory diagnostics feature of Container Intelligence Service helps you identify the root cause of common memory issues in ACK clusters — including memory leaks, memory fragmentation, and out of memory (OOM) errors. Diagnostic results appear as charts and tables so you can assess system memory health at a glance.
Memory diagnostics covers three areas: memory overview, memory analysis, and OOM analysis. You can inspect memory usage at both the node and pod level.
Diagnostic items may vary based on your cluster configuration. The items shown on the diagnostics page reflect your actual cluster state.
When you run the diagnostics feature, ACK runs a data collection program on each node to gather diagnostic results. Collected data includes the system version, load status, Docker and kubelet status, and key error messages in system logs. ACK does not collect business information or sensitive data.
Diagnostic workflow
Use the three diagnostic areas in sequence to narrow down a memory issue:
Memory overview — Check for high-level memory risks: leaked memory, memory fragmentation, unreleased Memcg entries, and THP waste. Use the charts to confirm whether abnormal usage is in kernel memory or application memory.
Memory analysis — Drill down to process-level and pod-level memory usage to identify which process or container is consuming excessive anonymous memory, page cache, or shared memory.
OOM analysis — Review OOM event counts and types to determine whether OOM errors are occurring at the node (Host) or container (cgroup) level, and which containers have hit their memory limits.
Memory overview
Memory overview surfaces diagnostic items related to memory risks. The following table describes each item.
| Diagnostic item | Description |
|---|---|
| Leaked Memory | Checks for system kernel memory leaks in the Slab, Vmalloc, and buddy system (allocpage). |
| Memory Usage | Displays system memory utilization. |
| Memcg | Evaluates whether unreleased memory cgroups (Memcg) are degrading system performance or causing statistical errors. |
| Memory Fragmentation | Checks for memory fragmentation that degrades system performance. |
| THPZeroPage | Evaluates the ratio of Transparent Huge Page (THP) waste. |
System memory usage is also displayed in charts, broken down into three categories:
Kernel memory (kernel): total memory used by the operating system kernel.
Application memory (app): total memory used by programs in user mode.
Free memory (free): total free system memory.
Key concepts
The following terms are used throughout memory diagnostics.
| Term | Description |
|---|---|
| Memory leak | A memory leak occurs when memory dynamically allocated to a program is never released, causing system memory utilization to grow continuously. Unresolved memory leaks degrade program performance and can cause system crashes. |
| Memory utilization | Memory utilization = (Total memory - Free memory) x 100 / Total memory. Page cache counts as free memory and does not affect memory utilization — the kernel can reclaim and reuse it at any time. |
| Unreleased Memcg | A memory cgroup that was not released due to a system exception. Unreleased Memcg entries can degrade system performance. |
| Memory fragmentation | After a system runs for an extended period, free contiguous memory blocks become too small to satisfy large contiguous allocation requests. This delays memory allocation and causes application jitter. |
| Ratio of THP waste | Ratio of THP waste = Number of zero THPs x 100% / Total number of THPs. See THP details below. |
| Buddy system | The Linux kernel algorithm for managing memory pages. It divides memory pages into 11 groups and manages blocks in powers of two: 4 KB, 8 KB, 16 KB, 32 KB ... 4 MB. Most memory pages are 4 KB. |
| Slab | A memory allocator that allocates small pieces of memory on top of the buddy system. |
| Vmalloc | A memory allocator that uses nonlinear mapping on top of the buddy system. |
| Page cache (filecache) | When Linux reads or writes a file, it caches the file content in memory for faster subsequent access. |
| Anonymous memory | Memory dynamically allocated to a process's heap and stack through new, malloc, or mmap. Not backed by a file system. |
| Shared memory | A memory block shared by two or more processes for inter-process communication. |
| tmpfs | A Linux temporary file system backed by memory. Content read or written to tmpfs is cached in memory. |
| hugetlb | Memory consumed by huge pages in a file system. |
THP details
Transparent Huge Pages (THP) are huge pages sized 2 MiB or 1 GiB in the kernel. Each subpage is 4 KiB, so one 2-MiB THP equals 512 subpages.
When THP is enabled, the kernel dynamically allocates THPs to reduce Translation Lookaside Buffer (TLB) misses and improve application performance. However, THP can cause memory bloat and memory overcommitment: when an application requests only 8 KiB (2 subpages), the kernel allocates a full 2-MiB THP — leaving 510 zero subpages that waste resident set size (RSS) and can trigger OOM errors.
Kernel memory metrics
In most cases, memory leaks are indicated by abnormal usage in Sunreclaim or the buddy system. Monitor these metrics closely.
| Metric | Description |
|---|---|
| SReclaimable | Memory that the Slab can reclaim. |
| Sunreclaim | Memory that the Slab cannot reclaim. Abnormal growth here is a strong indicator of a kernel memory leak. |
| PageTables | Memory occupied by kernel page tables. |
| Vmalloc | Memory allocated by the Vmalloc function. |
| KernelStack | Total memory occupied by the heap and stack of a process. |
| AllocPages | Memory allocated from the buddy system by functions such as alloc_pages. This memory cannot be retrieved through any node file — excessive use creates a memory black hole. |
Application memory metrics
When analyzing user-mode memory usage, focus on anonymous memory, shared memory, and page cache.
| Metric | Description |
|---|---|
| filecache | Page cache that can be reclaimed by running drop caches. |
| anon | Anonymous memory used by a program's heap and stack. High anon usage suggests a process memory leak or THP being enabled. |
| mlock | Memory locked by the system. |
| huge | Memory used by huge pages. |
| buffer | Memory used by block device and file system metadata. |
| shmem | Shared memory (tmpfs). Memory leaks occur if a tmpfs file is not deleted after the process exits, or if a file is deleted while it is still open. |
Memory analysis
Memory analysis is split into two views: process memory and pod memory.
Process memory
The process memory view shows memory usage per process, including anonymous memory, page cache, and shared memory.
Pod memory
The pod memory view shows which files are occupying page cache and shared memory in each container and pod, along with active and inactive cache ratios.
| Diagnostic item | Description |
|---|---|
| Pod | The name of the pod. |
| Container | The name of the container. |
| File | The full path of the file, including the file name. |
| Cache | The page cache (filecache) occupied by the file. |
| Container Cache | The container-level cache occupied by the file. Multiple processes in the same container may reference the same file. |
| Active Cache | Page cache that is currently in use. |
| Inactive Cache | Page cache that is not in use and is eligible for reclaim. |
OOM analysis
OOM analysis diagnoses out of memory errors and shows the following diagnostic items.
| Diagnostic item | Description |
|---|---|
| OS OOM Count | Total number of OOM errors from host startup to the time of diagnosis. |
| Available Memory | Current free system memory. |
| Low Watermark | The low memory threshold. When available memory drops below this value, the kernel triggers an asynchronous memory reclaim operation to free up memory. |
| Container | The name of the pod, ID of the container, or name of the cgroup. |
| limit | The memory limit configured for the container. |
| usage | Current memory used by the container. |
| OOM Count | Total number of OOM errors that have occurred in the container. |
| OOM Type | The type of OOM error: Host or cgroup. |