All Products
Search
Document Center

Container Service for Kubernetes:Integrate cloud GPU computing power

Last Updated:Oct 17, 2025

ACK One registered clusters provide a unified platform for orchestrating and managing heterogeneous computing resources. This capability significantly improves the resource utilization and efficiency of Kubernetes clusters for heterogeneous computing.

Node pool architecture

ACK One registered clusters use node pools to efficiently manage cluster nodes. A node pool is a collection of nodes that share the same configuration. You can create multiple node pools with different configurations in a single cluster.

image

Feature overview

General node pool management

Feature

Description

Related documentation

Lifecycle management

  • Create node pools in the console and configure basic information, network settings, instance specifications, storage configurations, and the desired number of nodes.

  • Edit some configurations of existing node pools.

  • Delete a node pool when its nodes are no longer needed. The node release behavior depends on the desired number of nodes settings and the billing method of the nodes.

  • View node pool details, including basic information, resource monitoring dashboards, node list, and scaling activities.

Create and manage node pools

Scaling

  • Manually scale node pools by adjusting the desired number of nodes. This method keeps the number of nodes at the desired level to save resource costs.

  • Configure auto scaling to automatically add or remove nodes based on workload demands.

Removing nodes

Remove unneeded nodes from a cluster or node pool. Follow the standard procedure to avoid unexpected behaviors.

Remove nodes from a node pool

Custom user data

  • Use a custom script to ensure that the node pool of a registered cluster correctly syncs the node status and meets cloud scheduling requirements.

  • The custom script must accept the system environment variables from the ACK One registered cluster.

Create custom scripts for node pools

GPU node pools

Feature

Description

Related documentation

Adding GPU nodes

Container Service for Kubernetes (ACK) provides unified scheduling and operations management for different types of compute-optimized GPU resources. This capability significantly improves the resource utilization of GPU clusters.

Create an ACK cluster with GPU-accelerated nodes

NVIDIA driver versions

ACK supports a list of NVIDIA driver versions.

NVIDIA driver versions supported by ACK

Custom GPU drivers

Different types and versions of ACK One registered clusters install different default versions of NVIDIA drivers.

If your application or CUDA library requires a specific NVIDIA driver version, you can customize the driver version installed on your GPU nodes.

Specify an NVIDIA driver version for nodes by adding a label

GPU monitoring

Feature

Description

Related documentation

Enable GPU monitoring

GPU monitoring is based on NVIDIA DCGM to build a powerful GPU monitoring system.

Enable GPU monitoring for ACK clusters

Dashboard panels

Learn about the meaning of each panel in the GPU monitoring dashboard.

Panels

Metric reference

GPU Monitoring 2.0 uses an Exporter, Prometheus, and Grafana architecture to provide richer GPU observability scenarios. View the list of GPU metrics exposed by the GPU Exporter, which can be used to build custom Grafana dashboards.

Introduction to metrics

GPU fault diagnosis and recovery

Feature

Description

Related documentation

Fault detection and isolation

Automatically detect GPU failures and isolate the affected nodes to prevent workloads from being scheduled on unhealthy hardware.

GPU fault detection and automatic isolation