Container Service for Kubernetes (ACK) integrates with Alibaba Cloud observability services out of the box, giving you metrics, logs, and traces across four layers: infrastructure, operating system, container/cluster, and application.
The recommended observability setup covers three areas of cluster stability:
-
Control plane — health and capacity of core Kubernetes components
-
Data plane — node health, storage, and network components
-
Applications — pod status, logs, APM, and tracing
Observability signals at a glance
The following table shows which tools cover each signal type in ACK:
| Signal | Tool | Coverage |
|---|---|---|
| Control plane metrics | Managed Service for Prometheus | kube-apiserver, etcd, kube-scheduler, kube-controller-manager |
| Control plane logs | Simple Log Service (SLS) | Centralized log storage for managed cluster components |
| Node events | ack-node-problem-detector + SLS Event Center | 90-day retention; 1-minute check intervals |
| Node process metrics | CloudMonitor | Top 5 resource-intensive processes per node |
| Node OS journal logs | SLS | kubelet, kernel, and container engine logs |
| Container/Pod metrics | Managed Service for Prometheus (kubelet + kube-state-metrics) | CPU, memory, network, storage per Pod |
| Container storage metrics | Managed Service for Prometheus + csi-plugin | NAS, CPFS, OSS, and disk volumes |
| Container network metrics | Managed Service for Prometheus (kubelet) | Inbound/outbound traffic per Pod |
| Application logs | SLS | Non-intrusive log collection from stdout or log files |
| APM / distributed tracing | Application Real-Time Monitoring Service (ARMS) | Java, Python, Go, OpenTelemetry |
| Frontend monitoring | ARMS Real User Monitoring (RUM) | Web, mobile, miniapp |
| GPU metrics | Managed Service for Prometheus + ack-gpu-exporter | NVIDIA DCGM-compatible metrics, shared and exclusive GPU |
| Kernel-level container monitoring | SysOM | Memory issues, page cache, OS-level visibility |
| Alerts | ACK alert management | Preconfigured rule sets for nodes, Pods, workloads, and network |
Quick setup checklist
Enable these baseline capabilities before diving into each section:
-
Deploy ack-node-problem-detector for node event monitoring.
-
Enable Managed Service for Prometheus for cluster and container metrics.
-
Connect SLS for log collection from pods and control plane components.
-
Configure alert rule sets for nodes, Pods, workloads, and network exceptions.
-
Enable the CloudMonitor plugin on ECS node pools for host-level process metrics.
After enabling event monitoring and Prometheus monitoring, configure contacts and contact groups for personnel responsible for clusters or applications, set up the corresponding alert rules, and subscribe to notification objects to those contact groups.
1. Cluster infrastructure control plane
The control plane handles API operations, workload scheduling, Kubernetes resource orchestration, cloud resource provisioning, and metadata storage. Key components include kube-apiserver, kube-scheduler, kube-controller-manager, cloud-controller-manager, and etcd.
ACK fully manages the control plane of ACK managed clusters with Service Level Agreement (SLA) guarantees. The following features give you real-time visibility into control plane health.
Enable control plane component monitoring
ACK extends the Kubernetes RESTful API so that external clients and in-cluster components (such as Prometheus) can scrape control plane metrics.
-
Managed Service for Prometheus: Use the out-of-the-box control plane monitoring dashboards for ACK managed Pro clusters.
-
Self-managed Prometheus: Follow Use a self-managed Prometheus instance to collect control plane metrics and configure alerts.
Collect control plane component logs
ACK clusters support centralized logging to your SLS projects. See Collect control plane component logs of ACK managed clusters.
Configure alert management
ACK includes default alert rules for core container anomalies. To add or customize rules:
-
Managed Service for Prometheus, SLS, or CloudMonitor data sources: See Alert management.
-
Self-managed Prometheus: See Best practices for configuring alert rules in Prometheus.
2. Cluster infrastructure data plane
Cluster nodes
ACK worker nodes provide the resource environment for workload execution. While Kubernetes has built-in scheduling, preemption, and eviction mechanisms to tolerate transient node issues, comprehensive stability requires proactive monitoring of node state and resource load.
Event monitoring with ack-node-problem-detector
ack-node-problem-detector is ACK's node event monitoring component. It provides:
-
Full compatibility with upstream Kubernetes Node Problem Detector
-
ACK-specific enhancements for the cluster node environment, OS compatibility, and container engine
-
Enhanced node inspection plug-ins with 1-minute check intervals
-
90-day event data retention via the integrated kube-eventer Deployment, which streams Kubernetes events to SLS Event Center — bypassing etcd's default 1-hour retention
When ack-node-problem-detector detects an abnormal node state, run kubectl describe node ${NodeName} to inspect the node's Condition, or view the node list on the Nodes page in the ACK console.
To receive alerts on pod startup failures and unavailable Service endpoints, configure alert management and subscribe to event-based notifications. SLS monitoring also supports event alerts.
Supported check items: GPU faults detected by ack-node-problem-detector and Node diagnosis plug-ins.
ECS process-level monitoring for cluster nodes
Each ACK node corresponds to an ECS instance. Enable the process monitoring feature in CloudMonitor to get:
-
Historical process analysis: Top 5 resource-intensive processes by memory consumption, CPU usage, and open file descriptors
-
Host-level OS monitoring: CPU, memory, network, disk usage, inode count, network traffic, and simultaneous connection count
Process monitoring configurations apply only to nodes added after you enable CloudMonitor on a node pool.
Host metrics vs. node metrics
Both measure resource usage, but differ in scope and calculation:
| Dimension | Host metrics | Node metrics |
|---|---|---|
| Scope | Physical or virtual machine resources | Container engine resources |
| Memory numerator | Total memory used by all processes (Usage) |
Total working memory (WorkingSet), including allocated memory, used memory, and page cache |
| Memory denominator | Host memory capacity (Capacity) |
Total allocatable memory (Allocatable), excluding resources reserved for the container engine |
| Formula | Usage / Capacity |
WorkingSet / Allocatable |
For more information, see Resource reservation policy.
Alert management for node anomalies
Enable and subscribe to alert rule sets for node health:
-
Alert Rule Set for Node Exceptions — abnormal node conditions
-
Alert Rule Set for Resource Exceptions — resource usage threshold breaches
Configure these through alert management.
Node OS journal log monitoring
systemd serves as the init system and service manager on Linux nodes. Its journal component provides system log collection, storage, real-time query, and analysis.
Collect and persist OS journal logs to SLS in scenarios such as:
-
Node stability monitoring (kubelet, OS kernel)
-
Workloads sensitive to OS or container engine changes, such as containers running in privileged mode, nodes with frequent resource overcommit, or workloads using OS resources directly
GPU and AI workload monitoring
For AI training and machine learning tasks, ACK provides GPU health monitoring and resource monitoring at the pod level.
-
GPU failure inspection: Update ack-node-problem-detector to V1.2.20 or later. See GPU faults detected by ack-node-problem-detector and Common GPU faults and solutions.
-
GPU resource monitoring: Cluster GPU monitoring provides pod-level GPU consumption via the ack-gpu-exporter component, exposing NVIDIA DCGM-compatible metrics for both shared GPU and exclusive GPU scenarios. See Introduction to metrics and Best practices for monitoring GPU resources.
-
GPU alerts: Enable alert management and subscribe to node GPU anomaly alerts.
Cluster data plane system components
Container network
CoreDNS
CoreDNS is the cluster's DNS service discovery component. Monitor:
-
CoreDNS resource usage in the data plane
-
The Responses (by rcode) metric — abnormal resolution response codes including NXDOMAIN, SERVFAIL, and FormErr
Recommended setup:
-
Managed Service for Prometheus users: Use the built-in CoreDNS monitoring dashboards.
-
Self-managed Prometheus users: Configure metrics collection using community CoreDNS monitoring methods.
-
Enable alert management and subscribe to:
-
Alert Rule Set for Network Exceptions — CoreDNS configuration reload failures and status anomalies
-
Alert Rule Set for Pod Exceptions — CoreDNS Pod status and resource issues
-
-
Analyze CoreDNS logs to diagnose slow resolution and high-risk domain requests.
Ingress
When using Ingress for external traffic routing, monitor traffic volumes and call details, and alert on abnormal routing states.
-
Metrics monitoring: Use Managed Service for Prometheus with the ACK Ingress controller for preconfigured traffic dashboards. Monitor and analyze Ingress logs through SLS.
-
Tracing: Enable ACK's Ingress tracing to report NGINX Ingress controller telemetry to Managed Service for OpenTelemetry for real-time aggregation, topology mapping, and persistent trace storage. See Enable Xtrace through Albconfig for trace tracking for ALB Ingress trace data.
-
Alerts: Subscribe to the Alert Rule Set for Network Exceptions through alert management.
Basic container network traffic monitoring
ACK clusters expose community-standard container metrics through the node kubelet, covering pod inbound and outbound traffic, abnormal traffic detection, and packet-level monitoring.
Pods configured with HostNetwork mode inherit the host's process network behavior. In this case, basic container monitoring metrics do not accurately reflect pod-level network traffic.
Monitoring options:
-
Managed Service for Prometheus: View pod-level network metrics directly in pod monitoring dashboards.
-
Self-managed Prometheus: Scrape kubelet metrics using community methods.
-
ECS-level network monitoring: Monitor host ECS network in the ECS console.
3. User applications
Container pod monitoring
Pods are the fundamental unit of application deployment in ACK. Their status and resource consumption directly affect application performance.
-
Prometheus-based pod metrics: Use Managed Service for Prometheus or self-managed Prometheus to collect community-standard container metrics from the node kubelet. Combine with kube-state-metrics (included in Managed Service for Prometheus or the ACK-provided prometheus-operator Helm chart) for comprehensive Pod metrics including CPU, memory, storage, and network. ACK clusters integrated with Managed Service for Prometheus include out-of-the-box pod monitoring dashboards.
-
Event monitoring for pod anomalies: Pod status changes trigger events. Enable event monitoring to track abnormal states such as OOM kills and pods not becoming ready. View real-time data on the Event Center page and historical data in SLS (retained for 90 days). Analyze pod lifecycle timelines through the pod event monitoring dashboard.
-
Alert subscriptions: After enabling alert management and event monitoring, subscribe to: See Best practices for configuring alert rules in Prometheus.
-
Alert Rule Set for Workload Exceptions
-
Alert Rule Set for Pod Exceptions
-
-
Custom Prometheus alert rules: Create custom rules for application-specific thresholds. See Create an alert rule for a Prometheus instance and use sample PromQL from Pod anomalies as a starting point.
Container application log monitoring
ACK provides non-intrusive log collection for application pods. Collect application logs in ACK clusters and use SLS log analysis features for anomaly diagnosis and operational status assessment.
If business applications do not implement log file splitting, use the pod's stdout logs.
Fine-grained memory monitoring
In Kubernetes, real-time container memory usage is measured by Working Set Size (WSS) — the metric Kubernetes uses for scheduling and resource allocation. WSS includes:
-
Active OS kernel memory components (excluding inactive anonymous memory)
-
OS layer memory components
Abnormal WSS growth can trigger PodOOMKilled events or node-level memory pressure and pod evictions. A common pattern in Java applications using Log4J or Logback with default new I/O (NIO) and memory-mapped file (mmap) configurations:
-
Anonymous memory spikes from frequent read/write operations under high log volume
-
Memory allocation black holes causing invisible WSS growth
ACK provides kernel-level container monitoring based on SysOM to surface OS-layer memory details. See Observe and resolve container memory issues through SysOM.
Integrate custom application metrics
If your team writes application code, use the Prometheus client to expose business-specific metrics through instrumentation. Collect and visualize these metrics in Prometheus to build unified dashboards for infrastructure and application teams, accelerating incident response and reducing mean time to recovery (MTTR).
APM and distributed tracing
Application Real-Time Monitoring Service (ARMS) provides application performance monitoring (APM) for multiple runtimes. Choose the integration method based on your application language.
| Language | Instrumentation type | Capabilities | Setup |
|---|---|---|---|
| Java | Non-intrusive (zero code changes) | Application topology, 3D dependency maps, interface/JVM monitoring, exception and slow transaction capture | Java application monitoring |
| Python | Intrusive (code instrumented) | Django/Flask/FastAPI support; LlamaIndex/Langchain AI/LLM tracking; topology, traces, API diagnostics | Install ack-onepilot and adjust Dockerfile. See Python application monitoring. |
| Go | Binary instrumentation | Application topology, database query analysis, API call monitoring | Install ack-onepilot and compile with instgo. See Go application monitoring. |
| OpenTelemetry | Intrusive | End-to-end distributed tracing, request analysis, topology, and dependency analysis | Managed Service for OpenTelemetry — see the Integration guide for language-specific setup. |
Frontend monitoring
For web applications, mobile apps, and miniapps that serve external users, enable the Real User Monitoring (RUM) feature in ARMS. RUM provides:
-
Full reconstruction of user interaction flows
-
Performance metrics: page load speed and API request tracing
-
Failure analysis: JavaScript errors and network failures
-
Stability monitoring: JavaScript loading errors, crashes, and Application Not Responding (ANR) errors
-
Log correlation to accelerate root cause diagnosis
See Integrate applications to get started.
Service Mesh observability
Alibaba Cloud Service Mesh (ASM) is a fully managed service mesh platform compatible with open-source Istio. It handles traffic routing and splitting, secures inter-service communication, and provides mesh observability — reducing development and operations overhead.
ASM supports full lifecycle observability across three operational phases:
| Phase | Focus |
|---|---|
| Day 0 (planning) | Validate traffic configuration states during system release |
| Day 1 (deployment) | Monitor real-time traffic distribution across microservices |
| Day 2 (maintenance) | Enforce stability based on Service Level Objective (SLO) metrics |
ASM provides unified Service Mesh observability capabilities through a converged telemetry pipeline:
-
Control plane monitoring:
-
Diagnostics: Diagnose ASM instances to detect anomalies that could affect service mesh function.
-
Alerting: Configure real-time anomaly log alerts for prompt response.
-
-
Data plane observability:
-
Access logs: Collect data plane access requests through SLS for centralized log analysis and dashboard visualization.
-
Prometheus metrics: Collect data plane metrics to Managed Service for Prometheus for gateway status, global mesh errors, service-level errors, and workload monitoring. See Integrate Managed Service for Prometheus to monitor ASM instances.
-
-
Microservice network topology: Use data plane Prometheus metrics as the source to visualize traffic and latency between microservices through Mesh Topology.
-
SLO management: Define SLOs to quantify microservice performance, track error rates and latency patterns, and continuously assess service health. See SLO management.
-
Distributed tracing: Instrument application code with OpenTelemetry and enable distributed tracing in ASM for end-to-end request analysis and dependency mapping.
Multi-cloud and hybrid cloud observability
Distributed Cloud Container Platform for Kubernetes (ACK One) is Alibaba Cloud's enterprise platform for hybrid cloud, multi-cluster management, distributed computing, and disaster recovery. It provides:
-
Cross-infrastructure cluster management across any region or infrastructure
-
Unified governance for compute, network, storage, security, monitoring, logging, jobs, applications, and traffic
-
Community-aligned APIs for seamless integration
Observability for ACK One registered clusters
ACK One registered clusters provide the same observability capabilities as standard ACK clusters, including integrations for SLS, Event Center, alerts, ARMS, and Managed Service for Prometheus.
Registered clusters require additional network configuration and authorization due to heterogeneous network environments and permission systems.
ACK One Fleet global monitoring
ACK One Fleet aggregates Prometheus metrics from multiple clusters into a unified monitoring dashboard through global aggregation instances, eliminating the need for manual metric comparison across clusters. Enable global monitoring after creating an ACK One Fleet instance and associating it with two clusters.
Unified alert management
Manage alert rules at the Fleet level to enforce consistency across all associated clusters:
-
Centralized rule management: Create or update alert rules at the Fleet level and automatically sync them to associated clusters.
-
Differentiated alerting: Configure cluster-specific alert rules when individual clusters require different thresholds.
GitOps observability
-
Fleet monitoring: Track core component health (APIServer, etcd) and the operation and performance of fully managed Argo CD.
-
Enable GitOps control plane and audit log collection and configure Argo CD alerts.
Argo Workflows monitoring
Argo Workflows is a cloud-native workflow engine for batch data processing, machine learning pipelines, infrastructure automation, and CI/CD. When deploying Argo Workflows on ACK or using Kubernetes clusters for distributed Argo workflows, enable the following:
-
Log persistence with SLS: Native Kubernetes garbage collection purges pod and workflow logs after resource cleanup. Integrate SLS with workflow clusters to collect and persist logs generated during workflow execution. View workflow logs through Argo CLI or Argo UI.
-
Prometheus monitoring: Enable Managed Service for Prometheus for workflow running status and cluster health monitoring.
Knative application observability
ACK Knative is ACK's serverless framework built on community Knative. Knative provides request-driven auto scaling (including scale-to-zero) and version management with canary rollouts. ACK Knative adds capabilities such as reducing cold start latency by retaining instances and workload forecasting through Advanced Horizontal Pod Autoscaler (AHPA). See Knative observability for an overview.
-
Log collection: ACK clusters integrate with SLS for non-intrusive log collection. Implement log collection on Knative through DaemonSet to automatically run a log agent on each node.
-
Prometheus monitoring: After deploying Knative applications, view real-time Knative data in Grafana dashboards — including pod scaling trends, response latency, request concurrency, and CPU/memory usage.