After you deploy a cluster that has the cloud-native AI component set installed, you can allocate cluster resources and view resource usage based on multiple dimensions. This helps you optimize the utilization of cluster resources. This topic describes basic O&M operations that you can perform on the cloud-native component set. For example, you can install the cloud-native AI component set, view resources dashboards, and manage users and quotas.
Background information
After you deploy a cluster that has the cloud-native AI component set installed, you can allocate cluster resources and view resource usage based on multiple dimensions. This helps you optimize the utilization of cluster resources.
If a cluster is used by multiple users, you must allocate a fixed amount of resources to each user in case the users compete for resources. The traditional method is to use Kubernetes resource quotas to allocate a fixed amount of resources to each user. However, the resource utilization varies by user groups. To improve the overall utilization of cluster resources, you can allow the users to share resources after you allocate cluster resources to them.
The following figure shows the organizational structure of an enterprise. You can set elastic quotas at different levels based on your business requirements. Each leaf node in the figure corresponds to a user group. To manage permissions and quotas separately, you can add users in a user group to one or more namespaces, and assign different roles to the users. This way, resources can be shared across user groups and users in the same user group can be isolated.

Prerequisites
- A Container Service for Kubernetes (ACK) Pro cluster is created. Make sure that Monitoring Agents and Log Service are enabled on the Component Configurations wizard page when you create the cluster. For more information, see Create a professional managed Kubernetes cluster.
- The Kubernetes version of the cluster is 1.18 or later.
Tasks
- Install the cloud-native AI component set.
- View resource dashboards.
- Set resource quotas by user groups.
- Manage users and user groups.
- Use idle resources to submit more workloads after the minimum amount of resources for each user is exhausted.
- Set the maximum amount of resources for each user.
- Set the minimum amount of resources for each user.
Step 1: Install the cloud-native AI component set
The cloud-native AI component set consists of components for task elasticity, data acceleration, AI task scheduling, AI task lifecycle management, AI Dashboard, and AI Developer Console. You can install the components based on your business requirements.
Deploy the cloud-native AI component set
Install and configure AI Dashboard
(Optional) Create a dataset
You can create and accelerate datasets based on the requirements of algorithm developers. The following section describes how to create a dataset in AI Dashboard or by using the CLI.
fashion-mnist dataset
Use kubectl to create a persistent volume (PV) and a persistent volume claim (PVC) of the Object Storage Service (OSS) type on a cluster node.
Accelerate a dataset
You can accelerate datasets on AI Dashboard. The following example shows how to accelerate a dataset named fashion-demo-pvc in the demo-ns namespace.
Step 2: View resource dashboards
You can view the usage of cluster resources based on multiple dimensions on resource dashboards provided by AI Dashboard. This helps you optimize resource allocation and improve resource utilization.
Cluster dashboard
- GPU Summary Of Cluster: displays the total number of GPU-accelerated nodes, the number of allocated GPU-accelerated nodes, and the number of unhealthy GPU-accelerated nodes in the cluster.
- Total GPU Nodes: displays the total number of GPU-accelerated nodes in the cluster.
- Unhealthy GPU Nodes: displays the number of unhealthy GPU-accelerated nodes in the cluster.
- GPU Memory(Used/Total): displays the ratio of GPU memory used by the cluster to the total GPU memory.
- GPU Memory(Allocated/Total): displays the ratio of GPU memory allocated by the cluster to the total GPU memory.
- GPU Utilization: displays the average GPU utilization of the cluster.
- GPUs(Allocated/Total): displays the ratio of the number of GPUs that are allocated by the cluster to the total number of GPUs.
- Training Job Summary Of Cluster: displays the numbers of training jobs that are in the following states: Running, Pending, Succeeded, and Failed.
Node dashboard
- GPU Node Details: displays information about the cluster nodes in a table. The following information is displayed: the name of each node, the IP address of each node, the role of each node, the GPU mode of each node (exclusive or shared), the number of GPUs provided by each node, the total amount of GPU memory provided by each node, the number of GPUs allocated on each node, the amount of GPU memory allocated on each node, the amount of GPU memory used on each node, and the average GPU utilization on each node.
- GPU Duty Cycle: displays the utilization of each GPU on each node.
- GPU Memory Usage: displays the memory usage of each GPU on each node.
- GPU Memory Usage Percentage: displays the percentage of memory usage per GPU on each node.
- Allocated GPUs Per Node: displays the number of GPUs allocated on each node.
- GPU Number Per Node: displays the total number of GPUs on each node.
- Total GPU Memory Per Node: displays the total amount of GPU memory on each node.
Training job dashboard
- Training Jobs: displays information about each training job in a table. The following information is displayed: the namespace of each training job, the name of each training job, the type of each training job, the status of each training job, the duration of each training job, the number of GPUs that are requested by each training job, the amount of GPU memory that is requested by each training job, the amount of GPU memory that is used by each training job, and the average GPU utilization of each training job.
- Job Instance Used GPU Memory: displays the amount of GPU memory that is used by each job instance.
- Job Instance Used GPU Memory Percentage: displays the percentage of GPU memory that is used by each job instance.
- Job Instance GPU Duty Cycle: displays the GPU utilization of each job instance.
Resource quota dashboard
- Elastic Quota Name: displays the name of the quota group.
- Namespace: displays the namespace to which resources belong.
- Resource Name: displays the type of resources.
- Max Quota: displays the maximum amount of resources that you can use in the specified namespace.
- Min Quota: displays the minimum amount of resources that you can use in the specified namespace when the cluster does not have sufficient resources.
- Used Quota: displays the amount of resources that are used in the specified namespace.
Step 3: Manage users and quotas

- Quota trees allow you to configure hierarchical resource quotas. Quota trees are used by the capacity scheduling plug-in. To optimize the overall utilization of cluster resources, you can allow users to share resources after you use quota trees to allocate resources to the users.
- Each user in Kubernetes owns a service account. The service account can be used as a credential to submit jobs and log on to the console. Permissions are granted to users based on user roles. For example, the admin role can log on to AI Dashboard and perform maintenance operations on a cluster. The researcher role can submit jobs, use cluster resources, and log on to AI Developer Console. The admin role has all permissions that the researcher role has.
- User groups are the smallest unit in resource allocation. Each user group corresponds to a leaf node in quota trees. Users must be associated with user groups before the users can use resources that are associated with the user groups.
The following section describes how to use a quota tree to set hierarchical resource quotas and how to use a user group to allocate resources to users. The following section also describes how to share and reclaim CPU resources by submitting a simple job.
Add a quota node and set resource quotas
You can set resource quotas by specifying the Min and Max parameters of each resource. The Min parameter specifies the minimum amount of resources that can be used. The Max parameter specifies the maximum amount of resources that can be used. After you associate namespaces with a leaf node of a quota tree, limits that are set on nodes between the root node and the leaf node apply to the namespaces.
Create users and user groups
A user can belong to one or more user groups. A user group can contain one or more users. You can associate user groups by users or associate users by user groups. You can allocate resources and grant permissions based on projects by using quota trees and user groups.
- Create users. For more information, see Generate the kubeconfig file and logon token of the newly created user.
- Create user groups. For more information, see Add a user group.
Capacity scheduling example
- Set both the minimum amount of CPU resources and maximum amount of CPU resources to 40 for the root node. This ensures that the quota tree has 40 CPU cores available.
- Set the minimum amount of CPU resources to 20 and the maximum amount of CPU resources to 40 for root.a and root.b.
- Set the minimum amount of CPU resources to 10 and the maximum amount of CPU resources to 20 for root.a.1, root.a.2, root.b.1, and root.b.2.
- Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace1. The maximum amount of CPU resources is set to 20 for root.a.1. Therefore, four pods (4 pods x 5 cores/pod = 20 cores) can run as normal.
- Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace2. The maximum amount of CPU resources is set to 20 for root.a.2. Therefore, four pods (4 pods x 5 cores/pod = 20 cores) can run as normal.
- Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace3. The minimum amount of CPU resources is set to 10 for root.b.1. Therefore, two pods (2 pods x 5 cores/pod = 10 cores) can run as normal. The scheduler considers the priority, availability, and creation time of each task and reclaims CPU resources from root.a. The scheduler reclaims one CPU core from root.a.1 and one CPU core from root.a.2. As a result, three pods (3 pods x 5 cores/pod = 15 cores) are running in namespace1 and namespace2, separately.
- Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace4. The minimum amount of CPU resources is set to 10 for root.b.2. Therefore, two pods (2 pods x 5 cores/pod = 10 cores) can run as normal. The scheduler considers the priority, availability, and creation time of each task and reclaims CPU resources from root.a. The scheduler reclaims one CPU core from root.a.1 and one CPU core from root.a.2. As a result, two pods (2 pods x 5 cores/pod = 10 cores) are running in namespace1 and namespace2, separately.
Perform the following operations: