Container Service for Kubernetes (ACK) integrates with Alibaba Cloud observability services by default and provides various observability features to help you quickly build an operations system for container scenarios. Use the basic features recommended in this topic to quickly set up a container observability system. You can also explore each section in depth to build a complete monitoring system covering the ACK cluster infrastructure data plane, control plane, and application system, which will improve the cluster's stability.
Introduction to observability
Observability ensures system stability during environmental changes by enabling real-time failure detection, prevention, manual or automated recovery, and scalable remediation. It provides visibility into cluster health through metrics, logs, and traces, accelerating troubleshooting and performance optimization.
In container ecosystems, observability operates across four layers: infrastructure, operating system, container/cluster, and application.

From an observability perspective, maintaining cluster stability can be divided into the following three aspects:
Cluster infrastructure control plane
The health status and capacity of cluster components such as kube-apiserver, etcd, kube-scheduler, kube-controller-manager, and cloud-controller-manager.
Key resources related to services provided by control plane components, such as the Server Load Balancer (SLB) bandwidth and the connection count of kube-apiserver.
Cluster infrastructure data plane
The health status of cluster nodes, such as abnormal node states, resource anomalies, node GPU card failures, and high node memory usage.
The stability of cluster user-side components, such as components or functions of container storage and container network.
User business systems deployed in the cluster (applications)
Application health status, such as the health status of user business pods and applications. Common exception states include pod termination or process exit due to insufficient memory (OOM Kills) and pods not being ready.
Recommended features to enable
This section introduces the out-of-the-box observability features provided by ACK to quickly build an operations system for container scenarios.
General stability scenarios
To enable event monitoring, deploy ack-node-problem-detector in the ACK cluster. For more information about the supported cluster check items, see Events of GPU faults detected by ack-node-problem-detector and Node diagnosis plug-ins supported by ack-node-problem-detector.
To monitor the health of clusters and containers in real time, deploy Prometheus in the ACK cluster, including Managed Service for Prometheus (recommended) and open source Prometheus. After you deploy Managed Service for Prometheus, you can monitor the following resources:
Control plane component monitoring for ACK managed clusters
Basic container resource monitoring, including node, workload, and pod
After you enable event monitoring and Prometheus monitoring, we recommend that you further configure contacts and contact groups for personnel responsible for clusters or applications, configure corresponding alert rules, and subscribe to the corresponding notification objects to contact groups.
Enable the Elastic Compute Service (ECS) node CloudMonitor plug-in to obtain basic metric monitoring for the ECS host layer, such as process monitoring and network monitoring.
Connect pod logs to Alibaba Cloud Simple Log Service (SLS). If business applications have not implemented log file splitting, we recommend that you use the pod's
stdoutlogs. For more information, see Collect container logs from ACK clusters.For business applications using Ingress for traffic routing in the cluster, enable the Ingress log details monitoring dashboard. For more information, see Analyze and monitor the access log of nginx-ingress-controller.
To monitor operating system layer resources, enable kernel-level container monitoring based on SysOM. For more information, see Kernel-level container monitoring based on SysOM.
Enable application performance management (APM) for important business applications deployed in the cluster. Choose the appropriate APM solution based on the development language and environment of the actual application. For more information, see Java Application Monitoring, Python Application Monitoring, Go Application Monitoring, What is Managed Service for OpenTelemetry, and Integration guide.
For Web applications, mobile terminals, and miniapps deployed in container services that require user experience behavior monitoring, we recommend that you enable the Real User Monitoring (RUM) feature in Alibaba Cloud Application Real-Time Monitoring Service (ARMS). This solution supports end-to-end tracing from page loading performance to API request failure rates.
Service Mesh scenarios
For Service Mesh scenarios, enable the following features:
Monitor Service Mesh control plane logs to ensure correct traffic routing configuration. Use the ASM diagnostics feature and integrate alert handling suggestions for real-time monitoring.
Monitor Service Mesh data plane access logs to monitor all access requests.
Monitor Service Mesh data plane metrics using Managed Service for Prometheus. For more information, see Integrate Managed Service for Prometheus to monitor ASM instances.
Enable network topology monitoring to observe and evaluate whether the traffic and latency between microservices meet your expectations. Use Prometheus monitoring metrics from the Service Mesh data plane as a data source to view the visualization interface of related services and configurations. For more information, see Enable Mesh Topology to improve observability.
Define service level objectives (SLOs) to observe call errors and latency behavior. For more information, see SLO management.
Integrate the OpenTelemetry protocol in the application code and enable distributed tracing in ASM.
Multi-cloud hybrid cloud ACK One scenarios
In scenarios of ACK One registered clusters or a multi-cloud hybrid cloud built using the ACK One multi-cluster Fleet, enable following features:
ACK One registered clusters provide the same observability features as ACK, including connecting SLS to registered clusters, Event Center to registered clusters, alert configuration to registered clusters, ARMS to registered clusters, and Managed Service for Prometheus to registered clusters. Refer to the specific documentation to set up the network environment and implement monitoring capabilities.
Enable global monitoring for ACK One multi-cluster Fleets to obtain unified monitoring capabilities. Enable unified alert management to ensure consistency of alert rules across associated clusters. For more information, see Multi-cluster alert management and Differentiated multi-cluster alerting configurations for individual child clusters.
When using ACK One GitOps, monitor the stability of Fleet cluster core components and the operation and performance of GitOps for fully managed Argo CD. For more information, see Fleet monitoring. We recommend that you enable GitOps control plane logs and audit logs and configure GitOps ArgoCD alerts. For more information, see Configure ACK One Argo CD alerts.
The following sections introduce recommended observability best practice solutions for cluster infrastructure control plane, cluster infrastructure data plane, and user business systems (applications).
1. Cluster infrastructure control plane

The Kubernetes cluster control plane manages core functions such as:
API layer operations
Workload scheduling
Kubernetes resource orchestration
Cloud resource provisioning
Metadata storage
The key components of the control plane include kube-apiserver, kube-scheduler, kube-controller-manager, cloud-controller-manager, and etcd.
The control plane of ACK managed clusters is fully managed by Alibaba Cloud with Service Level Agreement (SLA) guarantees. To improve control plane observability and facilitate cluster management, configuration, and optimization, use the following features to monitor and observe control plane workload health in real time with preconfigured alerts by default. Doing so helps prevent anomalies and ensure continuous stable operations.
Enable control plane component monitoring
ACK extends and enhances the Kubernetes RESTful API interfaces provided by the kube-apiserver component. This allows external clients and other components within the cluster (such as Prometheus metric collection) to interact with ACK clusters and obtain Prometheus metrics for the control plane components.
If you use Managed Service for Prometheus, you can use out-of-the-box control plane monitoring dashboards in ACK managed Pro clusters.
If you use a self-managed Prometheus instance, see Use a self-managed Prometheus instance to collect metrics of control plane components and configure alerts for data integration.
Monitor control plane component logs
ACK clusters support centralized logging to your account's SLS projects. For more information, see Collect control plane component logs of ACK managed clusters.
Configure alert management
ACK provides default alert rules for core container anomalies. Besides, you can also configure alert rules as needed.
For alert rule evaluation and anomaly notification of monitoring data from Managed Service for Prometheus, SLS, and CloudMonitor of associated cloud resources in ACK clusters, see Alert management.
If you use a self-managed Prometheus instance, see Best practices for configuring alert rules in Prometheus to configure alerts.
2. Cluster infrastructure data plane
Cluster nodes
ACK clusters use worker nodes to provide resource environments for workload deployment. To ensure containerized environment stability, you must monitor the abnormal states and resource load of each node. While Kubernetes provides scheduling, preemption, and eviction mechanisms to tolerate transient node anomalies for overall system reliability, comprehensive stability requires proactive measures to prevent, observe, respond to, and recover from node anomalies.
Event monitoring with ack-node-problem-detector
The ack-node-problem-detector component is provided by ACK for event monitoring with:
Full compatibility of upstream Kubernetes Node Problem Detector
Optimized checks for ACK-specific environments:
The node-problem-detector DaemonSet in the component has been adapted and enhanced for the cluster node environment, operating system compatibility, and container engine.
Enhanced node inspection functionality is provided as plug-ins with 1-minute check intervals, meeting most daily node O&M scenarios.
90-day event monitoring data persistence through its integrated kube-eventer Deployment, which streams all Kubernetes events to SLS event center. By default, Kubernetes events are stored in etcd with only 1-hour query retention. This persistence mechanism enables systems to perform historical event queries beyond etcd's limitations.
When ack-node-problem-detector detects an abnormal node state, use the kubectl describe node ${NodeName} command to view the abnormal Condition of the node, or view the abnormal node state in the node list on the Nodes page.
Use the alert management feature with ACK event monitoring to subscribe to events such as startup failures of critical pods and unavailable Service endpoints. SLS monitoring service also supports sending event alerts.
Enable ECS process-level monitoring for cluster nodes
Each node in an ACK cluster corresponds to an ECS instance. To gain deep infrastructure visibility, enable the process monitoring feature provided by CloudMonitor for ECS instances.
CloudMonitor (ECS monitoring) provides:
Historical process analysis: Track historical top 5 resource-intensive processes by memory consumption, CPU usage, and open file descriptors.
OS monitoring on ECS hosts: Monitor basic resource metrics at the host level, such as host CPU, memory, network, disk usage, inode count, host network traffic, and simultaneous network connection count.
Process monitoring configurations only apply to nodes added after enabling CloudMonitor in node pools.
Alert management and system monitoring configuration
Alert management supports cluster event monitoring. You can configure alerts for node anomaly events and resource usage threshold. We recommend that you enable and subscribe to multiple node-related alert rule sets, such as Alert Rule Set for Node Exceptions and Alert Rule Set for Resource Exceptions.
Node OS journal log monitoring and persistence
systemd serves as the init system and service manager in Linux systems, responsible for all services after system startup. The journal component within systemd provides:
System log collection and storage
Real-time log query and analysis
Using systemd journal to collect and monitor OS logs is preferred in scenarios such as:
Node stability metrics collection, such as kubelet and OS kernel
Workloads sensitive to OS or container engine that require enhanced monitoring, such as containers running in privileged mode, node resource with frequent overcommitment, and other scenarios directly using OS resources
To collect the log data and store it in an SLS project, see Collect systemd journal logs of the node.
GPU and AI training scenario monitoring
For AI training and machine learning tasks deployed in an ACK cluster, ACK provides functions such as node GPU health monitoring and GPU resource monitoring. Recommended solutions:
GPU failure inspection: Install and update ack-node-problem-detector to V1.2.20 or later.
See Common GPU faults and solutions for troubleshooting.
GPU resource monitoring: Cluster GPU monitoring provides pod-level GPU resource consumption. Through the ack-gpu-exporter component, you can obtain GPU monitoring metrics compatible with NVIDIA DCGM standards. ACK provides pod-level GPU monitoring metrics for both shared GPU and exclusive GPU scenarios. For more information, see Introduction to metrics.
For detailed steps on implementing comprehensive monitoring of cluster GPU nodes, see Best practices for monitoring GPU resources.
Container service alert management: Enable event monitoring and subscribe to node GPU anomaly alerts. See Common GPU faults and solutions for troubleshooting.
Cluster data plane system components
Container storage
Container storage types in ACK clusters include local storage on nodes, Secret/ConfigMap, and external storage such as Network Attached Storage (NAS) volumes, Cloud Parallel File Storage (CPFS) volumes, and Object Storage Service (OSS) volumes.
ACK uses the csi-plugin component to uniformly expose monitoring metrics for the above container storage types. These metrics are then collected by Managed Service for Prometheus to provide out-of-the-box monitoring dashboards. For a comprehensive overview of supported and unsupported storage monitoring types and their corresponding monitoring approaches, see Overview of container storage monitoring.
Container network
CoreDNS
CoreDNS is a key component of the cluster's DNS service discovery mechanism. To ensure component stability, monitor:
The resource usage of CoreDNS in the cluster data plane.
Key metrics such as abnormal resolution response codes (rcode). This metric is the Responses (by rcode) metric in CoreDNS dashboards.
DNS resolution anomalies such as NXDOMAIN, SERVFAIL, and FormErr.
Recommended practices:
CoreDNS monitoring
For Managed Service for Prometheus users: Use the built-in CoreDNS monitoring dashboards in ACK clusters.
For self-managed Prometheus users: Configure metrics collection using community CoreDNS monitoring methods.
Enable container service alert management and subscribe to alert rule sets such as:
Alert Rule Set for Network Exceptions, including CoreDNS configuration reloading failures and other CoreDNS status anomaly alerts.
Alert Rule Set for Pod Exceptions covering CoreDNS pod status and resource issues.
Analyze CoreDNS logs to resolve issues such as slow CoreDNS resolutions and high-risk domain requests.
Ingress
When using Ingress for customer-facing traffic routing, monitor traffic volumes and call details through Ingress, and set up alerts for abnormal Ingress routing states.
Recommended practices:
Monitoring
Use Managed Service for Prometheus and the Ingress controller in ACK clusters for preconfigured traffic monitoring dashboards.
Tracing
Enable ACK's Ingress tracing feature to report NGINX Ingress controller telemetry to Managed Service for OpenTelemetry. This enables real-time aggregation, topology mapping, and persistent storage of trace data for troubleshooting. For example, you can enable Xtrace through Albconfig for trace tracking to observe ALB Ingress trace data.
Alert management
Enable the container service alert management feature and subscribe to alert rule sets such as Alert Rule Set for Network Exceptions, receiving alerts for abnormal Ingress routing states.
Basic container network traffic monitoring
ACK clusters expose community-standard container metrics through the node kubelet, covering essential network traffic monitoring requirements including: pod inbound and outbound traffic, abnormal network traffic detection, and packet-level network monitoring.
However, if configured with HostNetwork mode, pods will inherit the host's process network behavior, causing basic container monitoring metrics to inaccurately reflect pod-level network traffic. Recommended practices:
Prometheus-based monitoring on kubelet metrics
Option 1: Managed Service for Prometheus
ACK provides out-of-the-box basic network traffic monitoring capabilities. You can view them directly in pod monitoring dashboards.
Option 2: Self-managed Prometheus
ECS-level network monitoring
3. User applications
ACK provides monitoring capabilities for container pods and application logs to ensure the stability of applications deployed on clusters.
Container pod monitoring
Applications deployed on the cluster run as pods. The status and resource load of pods directly affect the performance of applications running on them. Recommended practices:
Prometheus-based pod status and resource monitoring
Collect community-standard container metrics exposed by ACK clusters through node kubelet using Managed Service for Prometheus or self-managed Prometheus
Combine with state data of Kubernetes objects exposed by the kube-state-metrics component (included in the Managed Service for Prometheus or the ACK-provided community prometheus-operator Helm Chart) to monitor comprehensive pod metrics, including CPU, memory, storage, and basic container network traffic.
ACK clusters integrated with Managed Service for Prometheus provide out-of-the-box pod monitoring dashboards.
Event monitoring for pod abnormalities
Pod status changes trigger events. When pods enter abnormal states, enable event monitoring to track anomalies.
View real-time monitoring on the Event Center page of the console and persist events to the SLS event center for historical analysis (retained for 90 days).
Analyze pod lifecycle timelines through the pod event monitoring dashboard to identify abnormal states.
Subscribe to alerts for workloads and pods
After enabling container service alert management and event monitoring, subscribe to critical alert rule sets related to workloads and container pods, including Alert Rule Set for Workload Exceptions and Alert Rule Set for Pod Exceptions. For details, see Best practices for configuring alert rules in Prometheus.
Create custom Prometheus alert rules and subscribe to resource alerts
Address diverse application requirements, such as varying resource thresholds and critical resource priorities, by creating custom Prometheus alert rules:
Use standard PromQL to customize alerts. Refer to sample alert rules for workload exceptions and pod exceptions in Pod anomalies and modify PromQL expressions to tailor rules to specific needs.
Container application log monitoring
Applications in clusters generate operational logs that record critical processes. Monitoring these logs aids anomaly diagnosis and operational status assessment.
ACK provides Kubernetes logging capabilities for troubleshooting. ACK clusters provide non-intrusive log management. You can collect application logs in ACK clusters and use various log analysis functions provided by SLS.
Fine-grained memory monitoring for application processes
In Kubernetes, real-time container memory usage (pod memory) is measured by the Working Set Size (WSS), a critical metric for the Kubernetes scheduling mechanism to allocate pod memory resources. WSS includes:
Active OS Kernel memory components, excluding inactive (anonymous)
OS layer memory components
If your application process uses specific memory components, such as excessive page cache generated when writing to the file system, you may encounter internal memory "black hole" issues that require close monitoring in production systems.

Abnormal memory usage patterns in containerized environments may lead to WorkingSet inflation, potentially triggering PodOOMKilling incidents or node-level memory pressure and resulting in pod evictions. For example, Java applications using log frameworks such as Log4J and Logback with default new I/O (NIO) and memory-mapped file (mmap) configurations for log handling often exhibit:
Anonymous memory spikes from frequent read/write operations when processing large volumes of logs
Memory allocation black holes under high log throughput scenarios
To address troubleshooting challenges caused by opaque container engine-layer, ACK provides kernel-level container monitoring based on SysOM to observe and resolve container memory issues through SysOM.
Integrate application metrics with Prometheus monitoring and create custom dashboards
If you possess application development capabilities, we recommend using Prometheus client to expose business-specific metrics through instrumentation. These metrics can then be collected and visualized via Prometheus monitoring systems to create unified dashboards. Customized dashboards can be created for different teams (such as infrastructure and applications) to support daily operations monitoring and enable rapid incident response scenarios, effectively reducing mean time to recovery (MTTR).
Enable APM and tracing capabilities
Application performance monitoring (APM) is a common solution for monitoring performance of application processes. Alibaba Cloud ARMS provides multiple APM product capabilities. Choose the appropriate monitoring method based on your development language.
APM monitoring for Java applications (non-intrusive)
Capabilities: Automatically discovers application topologies, generates 3D dependency maps, monitors interfaces/JVM resources, and captures exceptions/slow transactions.
Implementation: Refer to Java application monitoring for Java application integration with zero-code modification.
APM monitoring for Python applications (intrusive/code-instrumented)
Scope: web applications using Django/Flask/FastAPI frameworks, and AI/LLM applications developed based on LlamaIndex/Langchain
Capabilities: application performance monitoring, including topology mapping, trace analysis, API diagnostics, anomaly detection, and LLM interaction tracking.
Implementation: Install the ack-onepilot component and adjust the Dockerfile to implement Python application performance monitoring.
APM monitoring for Go applications (intrusive/binary instrumentation)
Capabilities: Monitoring data such as application topology, database query analysis, and API calls is provided in ARMS.
Implementation: Install the ack-onepilot component and compile the application's Go binary file using
instgoduring container image build. For more information, see Go application monitoring.
OpenTelemetry-based instrumentation (intrusive)
Capabilities: Provides end-to-end distributed tracing, request analysis, chain topology, and dependency analysis for bottleneck identification in distributed application architecture and diagnosis improvement in the microservices.
Implementation: Managed Service for OpenTelemetry offers multiple data integration methods. Refer to the Integration guide to choose the appropriate integration method for your application based on development language and solution stability/maturity.
Frontend Web behavior monitoring
When your system exposes web interfaces to external users in the cluster, frontend page stability and continuity require robust assurance. The Real User Monitoring (RUM) feature provided by ARMS specializes in performance observability across web, mobile apps, and programs, focusing on user experience by:
Full reconstruction of user interaction flows
Performance metrics, including page load speed and API request tracing
Failure analysis, including JavaScript errors and network failures
Stability monitoring, including JavaScript loading errors and crashes/Application Not Responding (ANR) errors
Log correlation to accelerate root cause diagnosis
For details, see Integrate applications.
Service behavior observability via Service Mesh
Alibaba Cloud Service Mesh (ASM) delivers a fully managed service mesh platform compatible with the open-source Istio service mesh. It simplifies service governance by managing traffic routing and splitting in service calls, securing inter-service communication with authentication, and offering mesh observability capabilities, significantly reducing the development and O&M workload.
Multi-phase stability assurance
ASM's enhanced observability framework supports full lifecycle stability monitoring, detailed as follows:
Day 0 (planning phase)
Validate traffic configuration states during system release
Day 1 (deployment and configuration phase)
Monitor real-time traffic distribution across microservices
Day 2 (maintenance and optimization phase)
Enforce stability based on SLO metrics
Observability implementation path
ASM provides unified standardized Service Mesh observability capabilities, offering a converged telemetry pipeline configuration model to enhance observability support for cloud-native applications.
Control plane traffic routing monitoring
Diagnostics: Diagnose ASM instances to manually detect potential anomalies affecting normal service mesh function.
Alerting: Configure real-time anomaly log alerts and promptly handle anomalies.
Data plane observability
Access logs: Collect data plane access requests through SLS for centralized log analysis and dashboard visualization.
Prometheus metrics: Collect data plane metrics to Managed Service for Prometheus for comprehensive monitoring in dimensions such as gateway status, global-level mesh errors, service-level mesh errors, and mesh workload. Integrate Managed Service for Prometheus to monitor ASM instances, discovering potential issues and making timely adjustments and optimizations.
Microservice network topology
With data plane Prometheus monitoring metrics as the data source, visualize and evaluate whether traffic and latency between microservices meet expectations through Mesh Topology.
SLO measurement for error rates and latency patterns
SLO defines quantified monitoring metrics to describe microservice performance, assess application reliability, and continuously track service health. It serves as a measurable benchmark for service quality assessment and continuous optimization.
Distributed tracing
Capabilities: Provides end-to-end distributed tracing, request analysis, chain topology, and application dependency analysis.
Implementation: Instrument your application code with OpenTelemetry and enable distributed tracing in ASM.
Unified observation for multi-cloud and hybrid cloud environments
Distributed Cloud Container Platform for Kubernetes (ACK One) is an enterprise-level cloud-native platform developed by Alibaba Cloud for scenarios such as hybrid cloud, multi-cluster management, distributed computing, and disaster recovery. It enables the following:
Cross-infrastructure management: Connect and manage Kubernetes clusters across any region or infrastructure.
Consistent governance: Unified operations for compute, network, storage, security, monitoring, logging, jobs, applications, and traffic.
API compatibility: Community-aligned APIs for seamless integration.
Core scenarios include:
Hybrid cloud: Use ACK One registered clusters to register self-managed Kubernetes clusters in on-premises data centers to the cloud for hybrid cloud, enabling elastic scaling of cloud-based computing resources.
Multi-cloud: Use ACK One's Fleet instances to manage multiple ACK clusters or registered clusters, implementing multi-cloud zone-disaster recovery, unified application configuration distribution, and multi-cluster offline scheduling.
The following observability capabilities are recommended for common multi-cloud and hybrid cloud scenarios:
Observability integration for ACK One registered clusters
ACK One registered clusters provide consistent observability as ACK, including integration for SLS, event center, alerts, ARMS, and Managed Service for Prometheus.
NoteThis feature requires additional network configuration and authorization due to heterogeneous network environments and permission systems.
ACK One Fleet global monitoring
As enterprises scale, they require multiple Kubernetes clusters to meet requirements such as isolation, high availability, and disaster recovery. Traditional monitoring systems often lack multi-cluster visibility, creating operational blind spots. ACK One Fleet provides global monitoring:
Unified metric aggregation
Collects Prometheus metrics from multiple clusters into a unified monitoring dashboard through global aggregation instances.
Operational and diagnostic efficiency:
Performs cross-cluster correlation analysis
Eliminates manual metric comparison across clusters
Reduces context switching between monitoring systems
You can enable global monitoring after an ACK One Fleet instance is created and associated with two clusters.
Unified alert management
Centralized rule management: Create or update alert rules at Fleet level.
Consistency enforcement: Synchronize automatic rules to all associated clusters. To configure differentiated alert rules for specific clusters, see Differentiated multi-cluster alerting configurations.
GitOps observability
Fleet monitoring: ACK One Fleet provides monitoring for core components (including APIServer and etcd) and GitOps monitoring. This lets you track the operation and performance of the Fleet and the fully managed Argo CD.
GitOps logs: Enable GitOps control plane and audit log collection, and configure Argo CD alerts.
Monitor applications in Argo Workflows
Argo Workflows is a robust cloud-native workflow engine widely used in scenarios including batch data processing, machine learning pipelines, infrastructure automation, and continuous integration and continuous delivery (CI/CD). When deploying applications through Argo Workflows on ACK or using Kubernetes clusters for distributed Argo workflows, enabling the following observability capabilities is critical for stability and maintainability:
Log persistence with SLS
Challenges:
Native Kubernetes garbage collection purges pod/workflow logs after resource cleanup.
Linear growth of control plane and workflow controller resources without proper reclaim policies.
Solution:
Integrate SLS with workflow clusters to collect logs generated by pods during workflow execution and report them to SLS projects. You can view workflow logs through Argo CLI or Argo UI.
Prometheus monitoring
Enable Managed Service for Prometheus in workflow clusters for comprehensive observability capabilities, such as workflow running status and cluster health.
Observability for Knative-based applications
Knative is a Kubernetes-based serverless framework. It provides:
Request-driven auto scaling (including scale-to-zero)
Version management with canary rollouts
Building on full compatibility with community Knative and Kubernetes APIs, ACK Knative offers enhanced capabilities in multiple dimensions, such as reducing cold start latency by retaining instances and implementing workload forecasting based on Advanced Horizontal Pod Autoscaler (AHPA). For more information, see Knative observability.
Implementation
Log collection for Knative Services
ACK clusters are compatible with SLS, supporting non-intrusive log collection. Implement log collection on Knative through DaemonSet to automatically run a log agent on each node with improved operations efficiency.
Prometheus monitoring integration
After deploying applications in Knative, integrate Knative Service monitoring data with Prometheus. You can then view real-time Knative data in Grafana dashboards, including pod scaling trends, response latency, request concurrency, and CPU/memory resource usage.