Integrate cloud GPU computing power - Container Service for Kubernetes

ACK One registered clusters provide a unified platform for orchestrating and managing heterogeneous computing resources. This capability significantly improves the resource utilization and efficiency of Kubernetes clusters for heterogeneous computing.

Node pool architecture

ACK One registered clusters use node pools to efficiently manage cluster nodes. A node pool is a collection of nodes that share the same configuration. You can create multiple node pools with different configurations in a single cluster.

Feature overview

General node pool management

Feature	Description	Related documentation
Lifecycle management	Create node pools in the console and configure basic information, network settings, instance specifications, storage configurations, and the desired number of nodes. Edit some configurations of existing node pools. Delete a node pool when its nodes are no longer needed. The node release behavior depends on the desired number of nodes settings and the billing method of the nodes. View node pool details, including basic information, resource monitoring dashboards, node list, and scaling activities.	Create and manage node pools
Scaling	Manually scale node pools by adjusting the desired number of nodes. This method keeps the number of nodes at the desired level to save resource costs. Configure auto scaling to automatically add or remove nodes based on workload demands.	Manually scale node pools Configure auto scaling
Removing nodes	Remove unneeded nodes from a cluster or node pool. Follow the standard procedure to avoid unexpected behaviors.	Remove nodes from a node pool
Custom user data	Use a custom script to ensure that the node pool of a registered cluster correctly syncs the node status and meets cloud scheduling requirements. The custom script must accept the system environment variables from the ACK One registered cluster.	Create custom scripts for node pools

GPU node pools

Feature	Description	Related documentation
Adding GPU nodes	Container Service for Kubernetes (ACK) provides unified scheduling and operations management for different types of compute-optimized GPU resources. This capability significantly improves the resource utilization of GPU clusters.	Create an ACK cluster with GPU-accelerated nodes
NVIDIA driver versions	ACK supports a list of NVIDIA driver versions.	NVIDIA driver versions supported by ACK
Custom GPU drivers	Different types and versions of ACK One registered clusters install different default versions of NVIDIA drivers. If your application or CUDA library requires a specific NVIDIA driver version, you can customize the driver version installed on your GPU nodes.	Specify an NVIDIA driver version for nodes by adding a label

GPU monitoring

Feature	Description	Related documentation
Enable GPU monitoring	GPU monitoring is based on NVIDIA DCGM to build a powerful GPU monitoring system. To use Managed Service for Prometheus, see Enable Managed Service for Prometheus for a registered cluster. To use self-managed monitoring, install the ack-gpu-exporter component.	Enable GPU monitoring for ACK clusters
Dashboard panels	Learn about the meaning of each panel in the GPU monitoring dashboard.	Panels
Metric reference	GPU Monitoring 2.0 uses an Exporter, Prometheus, and Grafana architecture to provide richer GPU observability scenarios. View the list of GPU metrics exposed by the GPU Exporter, which can be used to build custom Grafana dashboards.	Introduction to metrics

GPU fault diagnosis and recovery

Feature	Description	Related documentation
Fault detection and isolation	Automatically detect GPU failures and isolate the affected nodes to prevent workloads from being scheduled on unhealthy hardware.	GPU fault detection and automatic isolation