ACK One registered clusters provide a unified platform for orchestrating and managing heterogeneous computing resources. This capability significantly improves the resource utilization and efficiency of Kubernetes clusters for heterogeneous computing.
Node pool architecture
ACK One registered clusters use node pools to efficiently manage cluster nodes. A node pool is a collection of nodes that share the same configuration. You can create multiple node pools with different configurations in a single cluster.
Feature overview
General node pool management
Feature | Description | Related documentation |
Lifecycle management |
| |
Scaling |
| |
Removing nodes | Remove unneeded nodes from a cluster or node pool. Follow the standard procedure to avoid unexpected behaviors. | |
Custom user data |
|
GPU node pools
Feature | Description | Related documentation |
Adding GPU nodes | Container Service for Kubernetes (ACK) provides unified scheduling and operations management for different types of compute-optimized GPU resources. This capability significantly improves the resource utilization of GPU clusters. | |
NVIDIA driver versions | ACK supports a list of NVIDIA driver versions. | |
Custom GPU drivers | Different types and versions of ACK One registered clusters install different default versions of NVIDIA drivers. If your application or CUDA library requires a specific NVIDIA driver version, you can customize the driver version installed on your GPU nodes. | Specify an NVIDIA driver version for nodes by adding a label |
GPU monitoring
Feature | Description | Related documentation |
Enable GPU monitoring | GPU monitoring is based on NVIDIA DCGM to build a powerful GPU monitoring system.
| |
Dashboard panels | Learn about the meaning of each panel in the GPU monitoring dashboard. | |
Metric reference | GPU Monitoring 2.0 uses an Exporter, Prometheus, and Grafana architecture to provide richer GPU observability scenarios. View the list of GPU metrics exposed by the GPU Exporter, which can be used to build custom Grafana dashboards. |
GPU fault diagnosis and recovery
Feature | Description | Related documentation |
Fault detection and isolation | Automatically detect GPU failures and isolate the affected nodes to prevent workloads from being scheduled on unhealthy hardware. |