All Products
Search
Document Center

Container Service for Kubernetes:Best practices for container observability

Last Updated:Aug 05, 2025

Container Service for Kubernetes (ACK) integrates with Alibaba Cloud observability services by default and provides various observability features to help you quickly build an operations system for container scenarios. Use the basic features recommended in this topic to quickly set up a container observability system. You can also explore each section in depth to build a complete monitoring system covering the ACK cluster infrastructure data plane, control plane, and application system, which will improve the cluster's stability.

Introduction to observability

Observability ensures system stability during environmental changes by enabling real-time failure detection, prevention, manual or automated recovery, and scalable remediation. It provides visibility into cluster health through metrics, logs, and traces, accelerating troubleshooting and performance optimization.

In container ecosystems, observability operates across four layers: infrastructure, operating system, container/cluster, and application.

image

From an observability perspective, maintaining cluster stability can be divided into the following three aspects:

  • Cluster infrastructure control plane

    • The health status and capacity of cluster components such as kube-apiserver, etcd, kube-scheduler, kube-controller-manager, and cloud-controller-manager.

    • Key resources related to services provided by control plane components, such as the Server Load Balancer (SLB) bandwidth and the connection count of kube-apiserver.

  • Cluster infrastructure data plane

    • The health status of cluster nodes, such as abnormal node states, resource anomalies, node GPU card failures, and high node memory usage.

    • The stability of cluster user-side components, such as components or functions of container storage and container network.

  • User business systems deployed in the cluster (applications)

    • Application health status, such as the health status of user business pods and applications. Common exception states include pod termination or process exit due to insufficient memory (OOM Kills) and pods not being ready.

Recommended features to enable

This section introduces the out-of-the-box observability features provided by ACK to quickly build an operations system for container scenarios.

General stability scenarios

Service Mesh scenarios

For Service Mesh scenarios, enable the following features:

Multi-cloud hybrid cloud ACK One scenarios

In scenarios of ACK One registered clusters or a multi-cloud hybrid cloud built using the ACK One multi-cluster Fleet, enable following features:


The following sections introduce recommended observability best practice solutions for cluster infrastructure control plane, cluster infrastructure data plane, and user business systems (applications).

1. Cluster infrastructure control plane

image

The Kubernetes cluster control plane manages core functions such as:

  • API layer operations

  • Workload scheduling

  • Kubernetes resource orchestration

  • Cloud resource provisioning

  • Metadata storage

The key components of the control plane include kube-apiserver, kube-scheduler, kube-controller-manager, cloud-controller-manager, and etcd.

The control plane of ACK managed clusters is fully managed by Alibaba Cloud with Service Level Agreement (SLA) guarantees. To improve control plane observability and facilitate cluster management, configuration, and optimization, use the following features to monitor and observe control plane workload health in real time with preconfigured alerts by default. Doing so helps prevent anomalies and ensure continuous stable operations.

Enable control plane component monitoring

ACK extends and enhances the Kubernetes RESTful API interfaces provided by the kube-apiserver component. This allows external clients and other components within the cluster (such as Prometheus metric collection) to interact with ACK clusters and obtain Prometheus metrics for the control plane components.

Monitor control plane component logs

ACK clusters support centralized logging to your account's SLS projects. For more information, see Collect control plane component logs of ACK managed clusters.

Configure alert management

ACK provides default alert rules for core container anomalies. Besides, you can also configure alert rules as needed.

2. Cluster infrastructure data plane

Cluster nodes

ACK clusters use worker nodes to provide resource environments for workload deployment. To ensure containerized environment stability, you must monitor the abnormal states and resource load of each node. While Kubernetes provides scheduling, preemption, and eviction mechanisms to tolerate transient node anomalies for overall system reliability, comprehensive stability requires proactive measures to prevent, observe, respond to, and recover from node anomalies.

Event monitoring with ack-node-problem-detector

The ack-node-problem-detector component is provided by ACK for event monitoring with:

  • Full compatibility of upstream Kubernetes Node Problem Detector

  • Optimized checks for ACK-specific environments:

    • The node-problem-detector DaemonSet in the component has been adapted and enhanced for the cluster node environment, operating system compatibility, and container engine.

    • Enhanced node inspection functionality is provided as plug-ins with 1-minute check intervals, meeting most daily node O&M scenarios.

    • 90-day event monitoring data persistence through its integrated kube-eventer Deployment, which streams all Kubernetes events to SLS event center. By default, Kubernetes events are stored in etcd with only 1-hour query retention. This persistence mechanism enables systems to perform historical event queries beyond etcd's limitations.

When ack-node-problem-detector detects an abnormal node state, use the kubectl describe node ${NodeName} command to view the abnormal Condition of the node, or view the abnormal node state in the node list on the Nodes page.

Use the alert management feature with ACK event monitoring to subscribe to events such as startup failures of critical pods and unavailable Service endpoints. SLS monitoring service also supports sending event alerts.

Enable ECS process-level monitoring for cluster nodes

Each node in an ACK cluster corresponds to an ECS instance. To gain deep infrastructure visibility, enable the process monitoring feature provided by CloudMonitor for ECS instances.

CloudMonitor (ECS monitoring) provides:

  • Historical process analysis: Track historical top 5 resource-intensive processes by memory consumption, CPU usage, and open file descriptors.

  • OS monitoring on ECS hosts: Monitor basic resource metrics at the host level, such as host CPU, memory, network, disk usage, inode count, host network traffic, and simultaneous network connection count.

    Expand to view the host and node resource metrics comparison

    Both host and node metrics are used to monitor resource usage, though the two differ in scope and calculation method. Host metrics reflect the resource usage of the entire host, while node metrics indicate resource consumption and allocation at the container layer. For more information, see Resource reservation policy.

    Dimension

    Host metrics

    Node metrics

    Scope

    Physical or virtual machine resources

    Container engine resources

    Memory calculation

    Numerator

    Total memory used by all processes on the host (Usage)

    Total working memory on the node (WorkingSet), including allocated memory, used memory, and page cache

    Denominator

    Host memory capacity (Capacity)

    Total allocatable memory on the node (Allocatable), excluding resources reserved for the node container engine

    Usage formula

    Usage / Capacity

    WorkingSet / Allocatable

Important

Process monitoring configurations only apply to nodes added after enabling CloudMonitor in node pools.

Alert management and system monitoring configuration

Alert management supports cluster event monitoring. You can configure alerts for node anomaly events and resource usage threshold. We recommend that you enable and subscribe to multiple node-related alert rule sets, such as Alert Rule Set for Node Exceptions and Alert Rule Set for Resource Exceptions.

Node OS journal log monitoring and persistence

systemd serves as the init system and service manager in Linux systems, responsible for all services after system startup. The journal component within systemd provides:

  • System log collection and storage

  • Real-time log query and analysis

Using systemd journal to collect and monitor OS logs is preferred in scenarios such as:

  • Node stability metrics collection, such as kubelet and OS kernel

  • Workloads sensitive to OS or container engine that require enhanced monitoring, such as containers running in privileged mode, node resource with frequent overcommitment, and other scenarios directly using OS resources

To collect the log data and store it in an SLS project, see Collect systemd journal logs of the node.

GPU and AI training scenario monitoring

For AI training and machine learning tasks deployed in an ACK cluster, ACK provides functions such as node GPU health monitoring and GPU resource monitoring. Recommended solutions:

Cluster data plane system components

Container storage

Container storage types in ACK clusters include local storage on nodes, Secret/ConfigMap, and external storage such as Network Attached Storage (NAS) volumes, Cloud Parallel File Storage (CPFS) volumes, and Object Storage Service (OSS) volumes.

Expand to view detailed description of container storage types

  • Local storage on nodes: Includes system disks and data disks mounted to nodes. Pods can use these storage resources by declaring HostPath or EmptyDir volumes.

    • HostPath volumes: not monitored by Kubernetes and require manual storage monitoring.

    • EmptyDir volumes: part of pod ephemeral storage and can be managed by declaring Ephemeral Storage Resource Request and Limit in the pod.

    Ephemeral storage monitoring scope includes:

    • Non-tmpfs EmptyDir volumes mounted by the pod (tmpfs EmptyDir volumes are stored in memory)

    • Pod log files stored on nodes

    • Writable layers of all containers in the pod

  • Secret/ConfigMap: These types are typically used to store cluster resource metadata. They do not have any strict storage monitoring requirements.

  • External storage: Pods can use these storage resources through persistent volumes (PVs) and persistent volume claims (PVCs). ACK supports CSI-provisioned storage types:

    • Additionally mounted disk volumes (not pre-mounted data volumes in node pools)

    • NAS volumes and CPFS volumes

    • OSS volumes

ACK uses the csi-plugin component to uniformly expose monitoring metrics for the above container storage types. These metrics are then collected by Managed Service for Prometheus to provide out-of-the-box monitoring dashboards. For a comprehensive overview of supported and unsupported storage monitoring types and their corresponding monitoring approaches, see Overview of container storage monitoring.

Container network

CoreDNS

CoreDNS is a key component of the cluster's DNS service discovery mechanism. To ensure component stability, monitor:

  • The resource usage of CoreDNS in the cluster data plane.

  • Key metrics such as abnormal resolution response codes (rcode). This metric is the Responses (by rcode) metric in CoreDNS dashboards.

  • DNS resolution anomalies such as NXDOMAIN, SERVFAIL, and FormErr.

Recommended practices:

  • CoreDNS monitoring

    • For Managed Service for Prometheus users: Use the built-in CoreDNS monitoring dashboards in ACK clusters.

    • For self-managed Prometheus users: Configure metrics collection using community CoreDNS monitoring methods.

  • Enable container service alert management and subscribe to alert rule sets such as:

    • Alert Rule Set for Network Exceptions, including CoreDNS configuration reloading failures and other CoreDNS status anomaly alerts.

    • Alert Rule Set for Pod Exceptions covering CoreDNS pod status and resource issues.

  • Analyze CoreDNS logs to resolve issues such as slow CoreDNS resolutions and high-risk domain requests.

Ingress

When using Ingress for customer-facing traffic routing, monitor traffic volumes and call details through Ingress, and set up alerts for abnormal Ingress routing states.

Recommended practices:

  • Monitoring

  • Tracing

    Enable ACK's Ingress tracing feature to report NGINX Ingress controller telemetry to Managed Service for OpenTelemetry. This enables real-time aggregation, topology mapping, and persistent storage of trace data for troubleshooting. For example, you can enable Xtrace through Albconfig for trace tracking to observe ALB Ingress trace data.

  • Alert management

    Enable the container service alert management feature and subscribe to alert rule sets such as Alert Rule Set for Network Exceptions, receiving alerts for abnormal Ingress routing states.

Basic container network traffic monitoring

ACK clusters expose community-standard container metrics through the node kubelet, covering essential network traffic monitoring requirements including: pod inbound and outbound traffic, abnormal network traffic detection, and packet-level network monitoring.

However, if configured with HostNetwork mode, pods will inherit the host's process network behavior, causing basic container monitoring metrics to inaccurately reflect pod-level network traffic. Recommended practices:

  • Prometheus-based monitoring on kubelet metrics

    • Option 1: Managed Service for Prometheus

      ACK provides out-of-the-box basic network traffic monitoring capabilities. You can view them directly in pod monitoring dashboards.

    • Option 2: Self-managed Prometheus

  • ECS-level network monitoring

    Monitor host ECS network in the ECS console.

3. User applications

ACK provides monitoring capabilities for container pods and application logs to ensure the stability of applications deployed on clusters.

Container pod monitoring

Applications deployed on the cluster run as pods. The status and resource load of pods directly affect the performance of applications running on them. Recommended practices:

  • Prometheus-based pod status and resource monitoring

    • Collect community-standard container metrics exposed by ACK clusters through node kubelet using Managed Service for Prometheus or self-managed Prometheus

    • Combine with state data of Kubernetes objects exposed by the kube-state-metrics component (included in the Managed Service for Prometheus or the ACK-provided community prometheus-operator Helm Chart) to monitor comprehensive pod metrics, including CPU, memory, storage, and basic container network traffic.

    • ACK clusters integrated with Managed Service for Prometheus provide out-of-the-box pod monitoring dashboards.

  • Event monitoring for pod abnormalities

    • Pod status changes trigger events. When pods enter abnormal states, enable event monitoring to track anomalies.

    • View real-time monitoring on the Event Center page of the console and persist events to the SLS event center for historical analysis (retained for 90 days).

    • Analyze pod lifecycle timelines through the pod event monitoring dashboard to identify abnormal states.

  • Subscribe to alerts for workloads and pods

    After enabling container service alert management and event monitoring, subscribe to critical alert rule sets related to workloads and container pods, including Alert Rule Set for Workload Exceptions and Alert Rule Set for Pod Exceptions. For details, see Best practices for configuring alert rules in Prometheus.

  • Create custom Prometheus alert rules and subscribe to resource alerts

    Address diverse application requirements, such as varying resource thresholds and critical resource priorities, by creating custom Prometheus alert rules:

Container application log monitoring

Applications in clusters generate operational logs that record critical processes. Monitoring these logs aids anomaly diagnosis and operational status assessment.

ACK provides Kubernetes logging capabilities for troubleshooting. ACK clusters provide non-intrusive log management. You can collect application logs in ACK clusters and use various log analysis functions provided by SLS.

Fine-grained memory monitoring for application processes

In Kubernetes, real-time container memory usage (pod memory) is measured by the Working Set Size (WSS), a critical metric for the Kubernetes scheduling mechanism to allocate pod memory resources. WSS includes:

  • Active OS Kernel memory components, excluding inactive (anonymous)

  • OS layer memory components

If your application process uses specific memory components, such as excessive page cache generated when writing to the file system, you may encounter internal memory "black hole" issues that require close monitoring in production systems.

image.png

Abnormal memory usage patterns in containerized environments may lead to WorkingSet inflation, potentially triggering PodOOMKilling incidents or node-level memory pressure and resulting in pod evictions. For example, Java applications using log frameworks such as Log4J and Logback with default new I/O (NIO) and memory-mapped file (mmap) configurations for log handling often exhibit:

  • Anonymous memory spikes from frequent read/write operations when processing large volumes of logs

  • Memory allocation black holes under high log throughput scenarios

To address troubleshooting challenges caused by opaque container engine-layer, ACK provides kernel-level container monitoring based on SysOM to observe and resolve container memory issues through SysOM.

Integrate application metrics with Prometheus monitoring and create custom dashboards

If you possess application development capabilities, we recommend using Prometheus client to expose business-specific metrics through instrumentation. These metrics can then be collected and visualized via Prometheus monitoring systems to create unified dashboards. Customized dashboards can be created for different teams (such as infrastructure and applications) to support daily operations monitoring and enable rapid incident response scenarios, effectively reducing mean time to recovery (MTTR).

Enable APM and tracing capabilities

Application performance monitoring (APM) is a common solution for monitoring performance of application processes. Alibaba Cloud ARMS provides multiple APM product capabilities. Choose the appropriate monitoring method based on your development language.

  • APM monitoring for Java applications (non-intrusive)

    • Capabilities: Automatically discovers application topologies, generates 3D dependency maps, monitors interfaces/JVM resources, and captures exceptions/slow transactions.

    • Implementation: Refer to Java application monitoring for Java application integration with zero-code modification.

  • APM monitoring for Python applications (intrusive/code-instrumented)

    • Scope: web applications using Django/Flask/FastAPI frameworks, and AI/LLM applications developed based on LlamaIndex/Langchain

    • Capabilities: application performance monitoring, including topology mapping, trace analysis, API diagnostics, anomaly detection, and LLM interaction tracking.

    • Implementation: Install the ack-onepilot component and adjust the Dockerfile to implement Python application performance monitoring.

  • APM monitoring for Go applications (intrusive/binary instrumentation)

    • Capabilities: Monitoring data such as application topology, database query analysis, and API calls is provided in ARMS.

    • Implementation: Install the ack-onepilot component and compile the application's Go binary file using instgo during container image build. For more information, see Go application monitoring.

  • OpenTelemetry-based instrumentation (intrusive)

    • Capabilities: Provides end-to-end distributed tracing, request analysis, chain topology, and dependency analysis for bottleneck identification in distributed application architecture and diagnosis improvement in the microservices.

    • Implementation: Managed Service for OpenTelemetry offers multiple data integration methods. Refer to the Integration guide to choose the appropriate integration method for your application based on development language and solution stability/maturity.

Frontend Web behavior monitoring

When your system exposes web interfaces to external users in the cluster, frontend page stability and continuity require robust assurance. The Real User Monitoring (RUM) feature provided by ARMS specializes in performance observability across web, mobile apps, and programs, focusing on user experience by:

  • Full reconstruction of user interaction flows

  • Performance metrics, including page load speed and API request tracing

  • Failure analysis, including JavaScript errors and network failures

  • Stability monitoring, including JavaScript loading errors and crashes/Application Not Responding (ANR) errors

  • Log correlation to accelerate root cause diagnosis

For details, see Integrate applications.

Service behavior observability via Service Mesh

Alibaba Cloud Service Mesh (ASM) delivers a fully managed service mesh platform compatible with the open-source Istio service mesh. It simplifies service governance by managing traffic routing and splitting in service calls, securing inter-service communication with authentication, and offering mesh observability capabilities, significantly reducing the development and O&M workload.

Multi-phase stability assurance

ASM's enhanced observability framework supports full lifecycle stability monitoring, detailed as follows:

  1. Day 0 (planning phase)

    Validate traffic configuration states during system release

  2. Day 1 (deployment and configuration phase)

    Monitor real-time traffic distribution across microservices

  3. Day 2 (maintenance and optimization phase)

    Enforce stability based on SLO metrics

Observability implementation path

ASM provides unified standardized Service Mesh observability capabilities, offering a converged telemetry pipeline configuration model to enhance observability support for cloud-native applications.

  • Control plane traffic routing monitoring

  • Data plane observability

  • Microservice network topology

    With data plane Prometheus monitoring metrics as the data source, visualize and evaluate whether traffic and latency between microservices meet expectations through Mesh Topology.

  • SLO measurement for error rates and latency patterns

    SLO defines quantified monitoring metrics to describe microservice performance, assess application reliability, and continuously track service health. It serves as a measurable benchmark for service quality assessment and continuous optimization.

  • Distributed tracing

    • Capabilities: Provides end-to-end distributed tracing, request analysis, chain topology, and application dependency analysis.

    • Implementation: Instrument your application code with OpenTelemetry and enable distributed tracing in ASM.

Unified observation for multi-cloud and hybrid cloud environments

Distributed Cloud Container Platform for Kubernetes (ACK One) is an enterprise-level cloud-native platform developed by Alibaba Cloud for scenarios such as hybrid cloud, multi-cluster management, distributed computing, and disaster recovery. It enables the following:

  • Cross-infrastructure management: Connect and manage Kubernetes clusters across any region or infrastructure.

  • Consistent governance: Unified operations for compute, network, storage, security, monitoring, logging, jobs, applications, and traffic.

  • API compatibility: Community-aligned APIs for seamless integration.

Core scenarios include:

  • Hybrid cloud: Use ACK One registered clusters to register self-managed Kubernetes clusters in on-premises data centers to the cloud for hybrid cloud, enabling elastic scaling of cloud-based computing resources.

  • Multi-cloud: Use ACK One's Fleet instances to manage multiple ACK clusters or registered clusters, implementing multi-cloud zone-disaster recovery, unified application configuration distribution, and multi-cluster offline scheduling.

The following observability capabilities are recommended for common multi-cloud and hybrid cloud scenarios:

  • Observability integration for ACK One registered clusters

    ACK One registered clusters provide consistent observability as ACK, including integration for SLS, event center, alerts, ARMS, and Managed Service for Prometheus.

    Note

    This feature requires additional network configuration and authorization due to heterogeneous network environments and permission systems.

  • ACK One Fleet global monitoring

    As enterprises scale, they require multiple Kubernetes clusters to meet requirements such as isolation, high availability, and disaster recovery. Traditional monitoring systems often lack multi-cluster visibility, creating operational blind spots. ACK One Fleet provides global monitoring:

    • Unified metric aggregation

      Collects Prometheus metrics from multiple clusters into a unified monitoring dashboard through global aggregation instances.

    • Operational and diagnostic efficiency:

      • Performs cross-cluster correlation analysis

      • Eliminates manual metric comparison across clusters

      • Reduces context switching between monitoring systems

    You can enable global monitoring after an ACK One Fleet instance is created and associated with two clusters.

  • Unified alert management

  • GitOps observability

Monitor applications in Argo Workflows

Argo Workflows is a robust cloud-native workflow engine widely used in scenarios including batch data processing, machine learning pipelines, infrastructure automation, and continuous integration and continuous delivery (CI/CD). When deploying applications through Argo Workflows on ACK or using Kubernetes clusters for distributed Argo workflows, enabling the following observability capabilities is critical for stability and maintainability:

  • Log persistence with SLS

    • Challenges:

      • Native Kubernetes garbage collection purges pod/workflow logs after resource cleanup.

      • Linear growth of control plane and workflow controller resources without proper reclaim policies.

    • Solution:

      Integrate SLS with workflow clusters to collect logs generated by pods during workflow execution and report them to SLS projects. You can view workflow logs through Argo CLI or Argo UI.

  • Prometheus monitoring

    Enable Managed Service for Prometheus in workflow clusters for comprehensive observability capabilities, such as workflow running status and cluster health.

Observability for Knative-based applications

Knative is a Kubernetes-based serverless framework. It provides:

  • Request-driven auto scaling (including scale-to-zero)

  • Version management with canary rollouts

Building on full compatibility with community Knative and Kubernetes APIs, ACK Knative offers enhanced capabilities in multiple dimensions, such as reducing cold start latency by retaining instances and implementing workload forecasting based on Advanced Horizontal Pod Autoscaler (AHPA). For more information, see Knative observability.

Implementation

  • Log collection for Knative Services

    ACK clusters are compatible with SLS, supporting non-intrusive log collection. Implement log collection on Knative through DaemonSet to automatically run a log agent on each node with improved operations efficiency.

  • Prometheus monitoring integration

    After deploying applications in Knative, integrate Knative Service monitoring data with Prometheus. You can then view real-time Knative data in Grafana dashboards, including pod scaling trends, response latency, request concurrency, and CPU/memory resource usage.