All Products
Search
Document Center

Container Service for Kubernetes:Best practices for container observability

Last Updated:Mar 26, 2026

Container Service for Kubernetes (ACK) integrates with Alibaba Cloud observability services out of the box, giving you metrics, logs, and traces across four layers: infrastructure, operating system, container/cluster, and application.

image

The recommended observability setup covers three areas of cluster stability:

  • Control plane — health and capacity of core Kubernetes components

  • Data plane — node health, storage, and network components

  • Applications — pod status, logs, APM, and tracing

Observability signals at a glance

The following table shows which tools cover each signal type in ACK:

Signal Tool Coverage
Control plane metrics Managed Service for Prometheus kube-apiserver, etcd, kube-scheduler, kube-controller-manager
Control plane logs Simple Log Service (SLS) Centralized log storage for managed cluster components
Node events ack-node-problem-detector + SLS Event Center 90-day retention; 1-minute check intervals
Node process metrics CloudMonitor Top 5 resource-intensive processes per node
Node OS journal logs SLS kubelet, kernel, and container engine logs
Container/Pod metrics Managed Service for Prometheus (kubelet + kube-state-metrics) CPU, memory, network, storage per Pod
Container storage metrics Managed Service for Prometheus + csi-plugin NAS, CPFS, OSS, and disk volumes
Container network metrics Managed Service for Prometheus (kubelet) Inbound/outbound traffic per Pod
Application logs SLS Non-intrusive log collection from stdout or log files
APM / distributed tracing Application Real-Time Monitoring Service (ARMS) Java, Python, Go, OpenTelemetry
Frontend monitoring ARMS Real User Monitoring (RUM) Web, mobile, miniapp
GPU metrics Managed Service for Prometheus + ack-gpu-exporter NVIDIA DCGM-compatible metrics, shared and exclusive GPU
Kernel-level container monitoring SysOM Memory issues, page cache, OS-level visibility
Alerts ACK alert management Preconfigured rule sets for nodes, Pods, workloads, and network

Quick setup checklist

Enable these baseline capabilities before diving into each section:

  1. Deploy ack-node-problem-detector for node event monitoring.

  2. Enable Managed Service for Prometheus for cluster and container metrics.

  3. Connect SLS for log collection from pods and control plane components.

  4. Configure alert rule sets for nodes, Pods, workloads, and network exceptions.

  5. Enable the CloudMonitor plugin on ECS node pools for host-level process metrics.

After enabling event monitoring and Prometheus monitoring, configure contacts and contact groups for personnel responsible for clusters or applications, set up the corresponding alert rules, and subscribe to notification objects to those contact groups.

1. Cluster infrastructure control plane

image

The control plane handles API operations, workload scheduling, Kubernetes resource orchestration, cloud resource provisioning, and metadata storage. Key components include kube-apiserver, kube-scheduler, kube-controller-manager, cloud-controller-manager, and etcd.

ACK fully manages the control plane of ACK managed clusters with Service Level Agreement (SLA) guarantees. The following features give you real-time visibility into control plane health.

Enable control plane component monitoring

ACK extends the Kubernetes RESTful API so that external clients and in-cluster components (such as Prometheus) can scrape control plane metrics.

Collect control plane component logs

ACK clusters support centralized logging to your SLS projects. See Collect control plane component logs of ACK managed clusters.

Configure alert management

ACK includes default alert rules for core container anomalies. To add or customize rules:

2. Cluster infrastructure data plane

Cluster nodes

ACK worker nodes provide the resource environment for workload execution. While Kubernetes has built-in scheduling, preemption, and eviction mechanisms to tolerate transient node issues, comprehensive stability requires proactive monitoring of node state and resource load.

Event monitoring with ack-node-problem-detector

ack-node-problem-detector is ACK's node event monitoring component. It provides:

  • Full compatibility with upstream Kubernetes Node Problem Detector

  • ACK-specific enhancements for the cluster node environment, OS compatibility, and container engine

  • Enhanced node inspection plug-ins with 1-minute check intervals

  • 90-day event data retention via the integrated kube-eventer Deployment, which streams Kubernetes events to SLS Event Center — bypassing etcd's default 1-hour retention

When ack-node-problem-detector detects an abnormal node state, run kubectl describe node ${NodeName} to inspect the node's Condition, or view the node list on the Nodes page in the ACK console.

To receive alerts on pod startup failures and unavailable Service endpoints, configure alert management and subscribe to event-based notifications. SLS monitoring also supports event alerts.

Supported check items: GPU faults detected by ack-node-problem-detector and Node diagnosis plug-ins.

ECS process-level monitoring for cluster nodes

Each ACK node corresponds to an ECS instance. Enable the process monitoring feature in CloudMonitor to get:

  • Historical process analysis: Top 5 resource-intensive processes by memory consumption, CPU usage, and open file descriptors

  • Host-level OS monitoring: CPU, memory, network, disk usage, inode count, network traffic, and simultaneous connection count

Important

Process monitoring configurations apply only to nodes added after you enable CloudMonitor on a node pool.

Host metrics vs. node metrics

Both measure resource usage, but differ in scope and calculation:

Dimension Host metrics Node metrics
Scope Physical or virtual machine resources Container engine resources
Memory numerator Total memory used by all processes (Usage) Total working memory (WorkingSet), including allocated memory, used memory, and page cache
Memory denominator Host memory capacity (Capacity) Total allocatable memory (Allocatable), excluding resources reserved for the container engine
Formula Usage / Capacity WorkingSet / Allocatable

For more information, see Resource reservation policy.

Alert management for node anomalies

Enable and subscribe to alert rule sets for node health:

  • Alert Rule Set for Node Exceptions — abnormal node conditions

  • Alert Rule Set for Resource Exceptions — resource usage threshold breaches

Configure these through alert management.

Node OS journal log monitoring

systemd serves as the init system and service manager on Linux nodes. Its journal component provides system log collection, storage, real-time query, and analysis.

Collect and persist OS journal logs to SLS in scenarios such as:

  • Node stability monitoring (kubelet, OS kernel)

  • Workloads sensitive to OS or container engine changes, such as containers running in privileged mode, nodes with frequent resource overcommit, or workloads using OS resources directly

See Collect systemd journal logs of the node.

GPU and AI workload monitoring

For AI training and machine learning tasks, ACK provides GPU health monitoring and resource monitoring at the pod level.

Cluster data plane system components

Container storage

ACK supports the following storage types:

  • Local node storage: System disks, data disks, HostPath volumes (not monitored by Kubernetes — require manual monitoring), and emptyDir volumes (managed via ephemeral storage Resource Requests and Limits). Ephemeral storage monitoring covers: non-tmpfs emptyDir volumes, pod log files on nodes, and writable layers of all containers in the pod.

  • Secret/ConfigMap: Used for cluster resource metadata; no strict storage monitoring requirements.

  • External storage via PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs): Additionally mounted disk volumes, NAS (Network Attached Storage) volumes, Cloud Parallel File Storage (CPFS) volumes, and Object Storage Service (OSS) volumes.

The csi-plugin component exposes monitoring metrics for all supported storage types, which Managed Service for Prometheus collects into out-of-the-box dashboards. For a complete overview of supported and unsupported storage types, see Overview of container storage monitoring.

Container network

CoreDNS

CoreDNS is the cluster's DNS service discovery component. Monitor:

  • CoreDNS resource usage in the data plane

  • The Responses (by rcode) metric — abnormal resolution response codes including NXDOMAIN, SERVFAIL, and FormErr

Recommended setup:

  • Managed Service for Prometheus users: Use the built-in CoreDNS monitoring dashboards.

  • Self-managed Prometheus users: Configure metrics collection using community CoreDNS monitoring methods.

  • Enable alert management and subscribe to:

    • Alert Rule Set for Network Exceptions — CoreDNS configuration reload failures and status anomalies

    • Alert Rule Set for Pod Exceptions — CoreDNS Pod status and resource issues

  • Analyze CoreDNS logs to diagnose slow resolution and high-risk domain requests.

Ingress

When using Ingress for external traffic routing, monitor traffic volumes and call details, and alert on abnormal routing states.

Basic container network traffic monitoring

ACK clusters expose community-standard container metrics through the node kubelet, covering pod inbound and outbound traffic, abnormal traffic detection, and packet-level monitoring.

Pods configured with HostNetwork mode inherit the host's process network behavior. In this case, basic container monitoring metrics do not accurately reflect pod-level network traffic.

Monitoring options:

  • Managed Service for Prometheus: View pod-level network metrics directly in pod monitoring dashboards.

  • Self-managed Prometheus: Scrape kubelet metrics using community methods.

  • ECS-level network monitoring: Monitor host ECS network in the ECS console.

3. User applications

Container pod monitoring

Pods are the fundamental unit of application deployment in ACK. Their status and resource consumption directly affect application performance.

  • Prometheus-based pod metrics: Use Managed Service for Prometheus or self-managed Prometheus to collect community-standard container metrics from the node kubelet. Combine with kube-state-metrics (included in Managed Service for Prometheus or the ACK-provided prometheus-operator Helm chart) for comprehensive Pod metrics including CPU, memory, storage, and network. ACK clusters integrated with Managed Service for Prometheus include out-of-the-box pod monitoring dashboards.

  • Event monitoring for pod anomalies: Pod status changes trigger events. Enable event monitoring to track abnormal states such as OOM kills and pods not becoming ready. View real-time data on the Event Center page and historical data in SLS (retained for 90 days). Analyze pod lifecycle timelines through the pod event monitoring dashboard.

  • Alert subscriptions: After enabling alert management and event monitoring, subscribe to: See Best practices for configuring alert rules in Prometheus.

    • Alert Rule Set for Workload Exceptions

    • Alert Rule Set for Pod Exceptions

  • Custom Prometheus alert rules: Create custom rules for application-specific thresholds. See Create an alert rule for a Prometheus instance and use sample PromQL from Pod anomalies as a starting point.

Container application log monitoring

ACK provides non-intrusive log collection for application pods. Collect application logs in ACK clusters and use SLS log analysis features for anomaly diagnosis and operational status assessment.

If business applications do not implement log file splitting, use the pod's stdout logs.

Fine-grained memory monitoring

In Kubernetes, real-time container memory usage is measured by Working Set Size (WSS) — the metric Kubernetes uses for scheduling and resource allocation. WSS includes:

  • Active OS kernel memory components (excluding inactive anonymous memory)

  • OS layer memory components

image.png

Abnormal WSS growth can trigger PodOOMKilled events or node-level memory pressure and pod evictions. A common pattern in Java applications using Log4J or Logback with default new I/O (NIO) and memory-mapped file (mmap) configurations:

  • Anonymous memory spikes from frequent read/write operations under high log volume

  • Memory allocation black holes causing invisible WSS growth

ACK provides kernel-level container monitoring based on SysOM to surface OS-layer memory details. See Observe and resolve container memory issues through SysOM.

Integrate custom application metrics

If your team writes application code, use the Prometheus client to expose business-specific metrics through instrumentation. Collect and visualize these metrics in Prometheus to build unified dashboards for infrastructure and application teams, accelerating incident response and reducing mean time to recovery (MTTR).

APM and distributed tracing

Application Real-Time Monitoring Service (ARMS) provides application performance monitoring (APM) for multiple runtimes. Choose the integration method based on your application language.

Language Instrumentation type Capabilities Setup
Java Non-intrusive (zero code changes) Application topology, 3D dependency maps, interface/JVM monitoring, exception and slow transaction capture Java application monitoring
Python Intrusive (code instrumented) Django/Flask/FastAPI support; LlamaIndex/Langchain AI/LLM tracking; topology, traces, API diagnostics Install ack-onepilot and adjust Dockerfile. See Python application monitoring.
Go Binary instrumentation Application topology, database query analysis, API call monitoring Install ack-onepilot and compile with instgo. See Go application monitoring.
OpenTelemetry Intrusive End-to-end distributed tracing, request analysis, topology, and dependency analysis Managed Service for OpenTelemetry — see the Integration guide for language-specific setup.

Frontend monitoring

For web applications, mobile apps, and miniapps that serve external users, enable the Real User Monitoring (RUM) feature in ARMS. RUM provides:

  • Full reconstruction of user interaction flows

  • Performance metrics: page load speed and API request tracing

  • Failure analysis: JavaScript errors and network failures

  • Stability monitoring: JavaScript loading errors, crashes, and Application Not Responding (ANR) errors

  • Log correlation to accelerate root cause diagnosis

See Integrate applications to get started.

Service Mesh observability

Alibaba Cloud Service Mesh (ASM) is a fully managed service mesh platform compatible with open-source Istio. It handles traffic routing and splitting, secures inter-service communication, and provides mesh observability — reducing development and operations overhead.

ASM supports full lifecycle observability across three operational phases:

Phase Focus
Day 0 (planning) Validate traffic configuration states during system release
Day 1 (deployment) Monitor real-time traffic distribution across microservices
Day 2 (maintenance) Enforce stability based on Service Level Objective (SLO) metrics

ASM provides unified Service Mesh observability capabilities through a converged telemetry pipeline:

Multi-cloud and hybrid cloud observability

Distributed Cloud Container Platform for Kubernetes (ACK One) is Alibaba Cloud's enterprise platform for hybrid cloud, multi-cluster management, distributed computing, and disaster recovery. It provides:

  • Cross-infrastructure cluster management across any region or infrastructure

  • Unified governance for compute, network, storage, security, monitoring, logging, jobs, applications, and traffic

  • Community-aligned APIs for seamless integration

Observability for ACK One registered clusters

ACK One registered clusters provide the same observability capabilities as standard ACK clusters, including integrations for SLS, Event Center, alerts, ARMS, and Managed Service for Prometheus.

Registered clusters require additional network configuration and authorization due to heterogeneous network environments and permission systems.

ACK One Fleet global monitoring

ACK One Fleet aggregates Prometheus metrics from multiple clusters into a unified monitoring dashboard through global aggregation instances, eliminating the need for manual metric comparison across clusters. Enable global monitoring after creating an ACK One Fleet instance and associating it with two clusters.

Unified alert management

Manage alert rules at the Fleet level to enforce consistency across all associated clusters:

GitOps observability

Argo Workflows monitoring

Argo Workflows is a cloud-native workflow engine for batch data processing, machine learning pipelines, infrastructure automation, and CI/CD. When deploying Argo Workflows on ACK or using Kubernetes clusters for distributed Argo workflows, enable the following:

  • Log persistence with SLS: Native Kubernetes garbage collection purges pod and workflow logs after resource cleanup. Integrate SLS with workflow clusters to collect and persist logs generated during workflow execution. View workflow logs through Argo CLI or Argo UI.

  • Prometheus monitoring: Enable Managed Service for Prometheus for workflow running status and cluster health monitoring.

Knative application observability

ACK Knative is ACK's serverless framework built on community Knative. Knative provides request-driven auto scaling (including scale-to-zero) and version management with canary rollouts. ACK Knative adds capabilities such as reducing cold start latency by retaining instances and workload forecasting through Advanced Horizontal Pod Autoscaler (AHPA). See Knative observability for an overview.