After you deploy a cluster that has the cloud-native AI component set installed, you can allocate cluster resources and view resource usage based on multiple dimensions. This helps you optimize the utilization of cluster resources. This topic describes basic O&M operations that you can perform on the cloud-native component set. For example, you can install the cloud-native AI component set, view resources dashboards, and manage users and quotas.

Background information

After you deploy a cluster that has the cloud-native AI component set installed, you can allocate cluster resources and view resource usage based on multiple dimensions. This helps you optimize the utilization of cluster resources.

If a cluster is used by multiple users, you must allocate a fixed amount of resources to each user in case the users compete for resources. The traditional method is to use Kubernetes resource quotas to allocate a fixed amount of resources to each user. However, the resource utilization varies by user groups. To improve the overall utilization of cluster resources, you can allow the users to share resources after you allocate cluster resources to them.

The following figure shows the organizational structure of an enterprise. You can set elastic quotas at different levels based on your business requirements. Each leaf node in the figure corresponds to a user group. To manage permissions and quotas separately, you can add users in a user group to one or more namespaces, and assign different roles to the users. This way, resources can be shared across user groups and users in the same user group can be isolated.

orgchart

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster is created. Make sure that Monitoring Agents and Log Service are enabled on the Component Configurations wizard page when you create the cluster. For more information, see Create a professional managed Kubernetes cluster.
  • The Kubernetes version of the cluster is 1.18 or later.

Tasks

This topic describes how to complete the following tasks:
  • Install the cloud-native AI component set.
  • View resource dashboards.
  • Set resource quotas by user groups.
  • Manage users and user groups.
  • Use idle resources to submit more workloads after the minimum amount of resources for each user is exhausted.
  • Set the maximum amount of resources for each user.
  • Set the minimum amount of resources for each user.

Step 1: Install the cloud-native AI component set

The cloud-native AI component set consists of components for task elasticity, data acceleration, AI task scheduling, AI task lifecycle management, AI Dashboard, and AI Developer Console. You can install the components based on your business requirements.

Deploy the cloud-native AI component set

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane of the details page, choose Applications > Cloud-native AI Component Set (Public Preview).
  5. On the Cloud Native AI Component Set page, click Deploy.
  6. On the Cloud-native AI Component Set page, select the components that you want to deploy and click Deploy Cloud-native AI Component Set. Then, the system checks the environment and dependencies, and automatically deploys the components after the precheck is completed.
    After the components are installed, you can view the following information in the Components list:
    • You can view the names and versions of the components that are installed in the cluster. You can deploy or uninstall components.
    • If a component is upgradable, you can also upgrade the component.
    • After you install ack-ai-dashboard and ack-ai-dev-console, you can find the hyperlinks of AI Dashboard and AI Developer Console in the upper-left corner of the Cloud-native AI Component Set page. You can click a hyperlink to log on to the corresponding console.

Install and configure AI Dashboard

  1. In the Interaction Mode section of the Cloud-native AI Component Set page, select AI Dashboard. Then, the Note dialog box appears, as shown in the following figure.
    K-AI-2
    Note
    • If you select Public Domain and use a public domain name to access AI Dashboard, you must configure access control policies.
    • If you select Internal Domain and use an internal domain name to access AI Dashboard, you must first install internal-facing Ingresses.
    • If you want to use an internal endpoint to access AI Dashboard, select Private IP in the Note dialog box.
    • If you select Pre-installed MySQL to store the data of AI Dashboard, data security is not guaranteed by service-level agreements (SLAs). Exercise caution when you select this option.
  2. In the Note dialog box, click the hyperlink to grant the cluster the required permissions.
    1. On the details page of the Resource Access Management (RAM) role, click the Permissions tab and click the name of the policy that you want to manage.
    2. Click Modify Policy Document on the Policy Document tab. In the Modify Policy Document panel, add the following content to the Action field in the Policy Document section:
      "ecs:DescribeInstances",
      "ecs:DescribeSpotPriceHistory",
      "ecs:DescribePrice",
      "eci:DescribeContainerGroups",
      "eci:DescribeContainerGroupPrice",
      "log:GetLogStoreLogs",
      "ims:CreateApplication",
      "ims:UpdateApplication",
      "ims:GetApplication",
      "ims:ListApplications",
      "ims:DeleteApplication",
      "ims:CreateAppSecret",
      "ims:GetAppSecret",
      "ims:ListAppSecretIds",
      "ims:ListUsers"
    3. Click OK.
  3. Select the method to store the data of AI Dashboard.
    In this example, Pre-installed MySQL is selected. You can select ApsaraDB RDS in production environments. For more information, see Install and configure AI Dashboard.
  4. Click Deploy Cloud-native AI Component Set.
    After the status of AI Dashboard changes to Ready, AI Dashboard is ready for use.

(Optional) Create a dataset

You can create and accelerate datasets based on the requirements of algorithm developers. The following section describes how to create a dataset in AI Dashboard or by using the CLI.

fashion-mnist dataset

Use kubectl to create a persistent volume (PV) and a persistent volume claim (PVC) of the Object Storage Service (OSS) type on a cluster node.

  1. Create a PV and PVC based on the following YAML template:
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: fashion-demo-pv
    spec:
      accessModes:
      - ReadWriteMany
      capacity:
        storage: 10Gi
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeAttributes:
          bucket: fashion-mnist
          otherOpts: "-o max_stat_cache_size=0 -o allow_other"
          url: oss-cn-beijing.aliyuncs.com
          akId: "AKID"
          akSecret: "AKSECRET"
        volumeHandle: fashion-demo-pv
      persistentVolumeReclaimPolicy: Retain
      storageClassName: oss
      volumeMode: Filesystem
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: fashion-demo-pvc
      namespace: demo-ns
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 10Gi
      selector:
        matchLabels:
          alicloud-pvname: fashion-demo-pv
      storageClassName: oss
      volumeMode: Filesystem
      volumeName: fashion-demo-pv
  2. Check the status of the PV and PVC.
    1. Run the following command to query the status of the PV:
      kubectl get pv fashion-mnist-jackwg

      Expected output:

      NAME                   CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS   REASON   AGE
      fashion-mnist-jackwg   10Gi       RWX            Retain           Bound    ns1/fashion-mnist-jackwg-pvc   oss                     8h
    2. Run the following command to query the status of the PVC:
      kubectl get pvc fashion-mnist-jackwg-pvc -n ns1
      Expected output:
      NAME                       STATUS   VOLUME                 CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      fashion-mnist-jackwg-pvc   Bound    fashion-mnist-jackwg   10Gi       RWX            oss            8h

Accelerate a dataset

You can accelerate datasets on AI Dashboard. The following example shows how to accelerate a dataset named fashion-demo-pvc in the demo-ns namespace.

  1. Access AI Dashboard as an administrator.
  2. In the left-side navigation pane of AI Dashboard, choose Dataset > Dataset List.
  3. On the Dataset List page, find the dataset that you want to accelerate and click Accelerate in the Actions column.
    After the dataset is accelerated, the page in the following figure appears.Accelerate Dataset

Step 2: View resource dashboards

You can view the usage of cluster resources based on multiple dimensions on resource dashboards provided by AI Dashboard. This helps you optimize resource allocation and improve resource utilization.

Cluster dashboard

After you log on to AI Dashboard, you are redirected to the cluster dashboard by default. You can view the following metrics on the cluster dashboard:
  • GPU Summary Of Cluster: displays the total number of GPU-accelerated nodes, the number of allocated GPU-accelerated nodes, and the number of unhealthy GPU-accelerated nodes in the cluster.
  • Total GPU Nodes: displays the total number of GPU-accelerated nodes in the cluster.
  • Unhealthy GPU Nodes: displays the number of unhealthy GPU-accelerated nodes in the cluster.
  • GPU Memory(Used/Total): displays the ratio of GPU memory used by the cluster to the total GPU memory.
  • GPU Memory(Allocated/Total): displays the ratio of GPU memory allocated by the cluster to the total GPU memory.
  • GPU Utilization: displays the average GPU utilization of the cluster.
  • GPUs(Allocated/Total): displays the ratio of the number of GPUs that are allocated by the cluster to the total number of GPUs.
  • Training Job Summary Of Cluster: displays the numbers of training jobs that are in the following states: Running, Pending, Succeeded, and Failed.

Node dashboard

On the Cluster page, click Nodes in the upper-right corner to navigate to the node dashboard. You can view the following metrics on the node dashboard:
  • GPU Node Details: displays information about the cluster nodes in a table. The following information is displayed: the name of each node, the IP address of each node, the role of each node, the GPU mode of each node (exclusive or shared), the number of GPUs provided by each node, the total amount of GPU memory provided by each node, the number of GPUs allocated on each node, the amount of GPU memory allocated on each node, the amount of GPU memory used on each node, and the average GPU utilization on each node.
  • GPU Duty Cycle: displays the utilization of each GPU on each node.
  • GPU Memory Usage: displays the memory usage of each GPU on each node.
  • GPU Memory Usage Percentage: displays the percentage of memory usage per GPU on each node.
  • Allocated GPUs Per Node: displays the number of GPUs allocated on each node.
  • GPU Number Per Node: displays the total number of GPUs on each node.
  • Total GPU Memory Per Node: displays the total amount of GPU memory on each node.

Training job dashboard

On the Nodes page, click TrainingJobs in the upper-right corner to navigate to the training job dashboard. You can view the following metrics in the training job dashboard:
  • Training Jobs: displays information about each training job in a table. The following information is displayed: the namespace of each training job, the name of each training job, the type of each training job, the status of each training job, the duration of each training job, the number of GPUs that are requested by each training job, the amount of GPU memory that is requested by each training job, the amount of GPU memory that is used by each training job, and the average GPU utilization of each training job.
  • Job Instance Used GPU Memory: displays the amount of GPU memory that is used by each job instance.
  • Job Instance Used GPU Memory Percentage: displays the percentage of GPU memory that is used by each job instance.
  • Job Instance GPU Duty Cycle: displays the GPU utilization of each job instance.

Resource quota dashboard

On the Training Jobs page, click Quota in the upper-right corner to navigate to the resource quota dashboard. You can view the following metrics on the resource quota dashboard: Quota (cpu), Quota (memory), Quota (nvidia.com/gpu), Quota (aliyun.com/gpu-mem), and Quota (aliyun.com/gpu). Each metric displays the information about resource quotas in a table. The following information is displayed:
  • Elastic Quota Name: displays the name of the quota group.
  • Namespace: displays the namespace to which resources belong.
  • Resource Name: displays the type of resources.
  • Max Quota: displays the maximum amount of resources that you can use in the specified namespace.
  • Min Quota: displays the minimum amount of resources that you can use in the specified namespace when the cluster does not have sufficient resources.
  • Used Quota: displays the amount of resources that are used in the specified namespace.

Step 3: Manage users and quotas

The cloud-native AI component set allows you to manage users and resource quotas by using the following resource objects: Users, User Groups, Quota Trees, Quota Nodes, and Kubernetes Namespaces. The following figure describes the relationships among these resource objects. Terms
  • Quota trees allow you to configure hierarchical resource quotas. Quota trees are used by the capacity scheduling plug-in. To optimize the overall utilization of cluster resources, you can allow users to share resources after you use quota trees to allocate resources to the users.
  • Each user in Kubernetes owns a service account. The service account can be used as a credential to submit jobs and log on to the console. Permissions are granted to users based on user roles. For example, the admin role can log on to AI Dashboard and perform maintenance operations on a cluster. The researcher role can submit jobs, use cluster resources, and log on to AI Developer Console. The admin role has all permissions that the researcher role has.
  • User groups are the smallest unit in resource allocation. Each user group corresponds to a leaf node in quota trees. Users must be associated with user groups before the users can use resources that are associated with the user groups.

The following section describes how to use a quota tree to set hierarchical resource quotas and how to use a user group to allocate resources to users. The following section also describes how to share and reclaim CPU resources by submitting a simple job.

Add a quota node and set resource quotas

You can set resource quotas by specifying the Min and Max parameters of each resource. The Min parameter specifies the minimum amount of resources that can be used. The Max parameter specifies the maximum amount of resources that can be used. After you associate namespaces with a leaf node of a quota tree, limits that are set on nodes between the root node and the leaf node apply to the namespaces.

  1. If no namespace is available, you must first create namespaces. If namespaces are available, you must make sure that the namespace that you select does not contain pods in the Running state.
    kubectl create ns namespace1
    kubectl create ns namespace2
    kubectl create ns namespace3
    kubectl create ns namespace4
  2. Create a quota node and associate it with a namespace.

Create users and user groups

A user can belong to one or more user groups. A user group can contain one or more users. You can associate user groups by users or associate users by user groups. You can allocate resources and grant permissions based on projects by using quota trees and user groups.

  1. Create users. For more information, see Generate the kubeconfig file and logon token of the newly created user.
  2. Create user groups. For more information, see Add a user group.

Capacity scheduling example

The following section describes how capacity scheduling is used to share and reclaim resources by creating pods that request CPU cores. Each quota node is configured with the minimum amount of CPU resources and maximum amount of CPU resources. The following section describes the process:
  1. Set both the minimum amount of CPU resources and maximum amount of CPU resources to 40 for the root node. This ensures that the quota tree has 40 CPU cores available.
  2. Set the minimum amount of CPU resources to 20 and the maximum amount of CPU resources to 40 for root.a and root.b.
  3. Set the minimum amount of CPU resources to 10 and the maximum amount of CPU resources to 20 for root.a.1, root.a.2, root.b.1, and root.b.2.
  4. Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace1. The maximum amount of CPU resources is set to 20 for root.a.1. Therefore, four pods (4 pods x 5 cores/pod = 20 cores) can run as normal.
  5. Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace2. The maximum amount of CPU resources is set to 20 for root.a.2. Therefore, four pods (4 pods x 5 cores/pod = 20 cores) can run as normal.
  6. Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace3. The minimum amount of CPU resources is set to 10 for root.b.1. Therefore, two pods (2 pods x 5 cores/pod = 10 cores) can run as normal. The scheduler considers the priority, availability, and creation time of each task and reclaims CPU resources from root.a. The scheduler reclaims one CPU core from root.a.1 and one CPU core from root.a.2. As a result, three pods (3 pods x 5 cores/pod = 15 cores) are running in namespace1 and namespace2, separately.
  7. Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace4. The minimum amount of CPU resources is set to 10 for root.b.2. Therefore, two pods (2 pods x 5 cores/pod = 10 cores) can run as normal. The scheduler considers the priority, availability, and creation time of each task and reclaims CPU resources from root.a. The scheduler reclaims one CPU core from root.a.1 and one CPU core from root.a.2. As a result, two pods (2 pods x 5 cores/pod = 10 cores) are running in namespace1 and namespace2, separately.

Perform the following operations:

  1. Create namespaces and a quota tree.
    1. Run the following command to create four namespaces.
      Run the following command to create namespace1:
      kubectl create ns namespace1
    2. Create a quota tree based on the following figure.
      orgchart2
  2. Create a Deployment in namespace1 by using the following YAML template. The Deployment provisions five pods and each pod requests five CPU cores.
    If you do not set elastic quotas, users can use only 10 CPU cores because the minimum amount of CPU resources is set to 10, which means that two pods are created. After you set elastic quotas:
    • When 40 CPU cores are available in the cluster, four pods are created (4 pods x 5 core/pod = 20 cores).
    • The last pod is in the Pending state because the maximum amount of resources (cpu.max=20) is reached.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx1
      namespace: namespace1
      labels:
        app: nginx1
    spec:
      replicas: 5
      selector:
        matchLabels:
          app: nginx1
      template:
        metadata:
          name: nginx1
          labels:
            app: nginx1
        spec:
          containers:
          - name: nginx1
            image: nginx
            resources:
              limits:
                cpu: 5
              requests:
                cpu: 5
  3. Create another Deployment in namespace2 by using the following YAML template. The Deployment provisions five pods and each pod requests five CPU cores.
    If you do not set elastic quotas, users can use only 10 CPU cores because the minimum amount of CPU resources is set to 10, which means that two pods are created. After you set elastic quotas:
    • When 20 CPU cores (40 cores - 20 cores in namespace1) are available in the cluster, four pods (4 pods x 5 core/pod = 20 cores) are created.
    • The last pod is in the Pending state because the maximum amount of resources (cpu.max=20) is reached.
    • After you create the preceding two Deployments, the pods in namespace1 and namespace2 have used 40 CPU cores, which is the maximum number of CPU cores for the root node.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx2
      namespace: namespace2
      labels:
        app: nginx2
    spec:
      replicas: 5
      selector:
        matchLabels:
          app: nginx2
      template:
        metadata:
          name: nginx2
          labels:
            app: nginx2
        spec:
          containers:
          - name: nginx2
            image: nginx
            resources:
              limits:
                cpu: 5
              requests:
                cpu: 5
  4. Create a third Deployment in namespace3 by using the following YAML template. This Deployment provisions five pods and each pod requests five CPU cores.
    • The cluster does not have idle resources. The scheduler reclaims 10 CPU cores from root.a to guarantee the minimum amount of CPU resources for root.b.1.
    • Before the scheduler reclaims the temporarily used 10 CPU cores, it also considers other factors, such as the priority, availability, and creation time of the workloads of root.a. Therefore, after the pods of nginx3 are scheduled based on the reclaimed 10 CPU cores, two pods are in the Running state and the other three are in the Pending state.
    • After 10 CPU cores are reclaimed from root.a, both namespace1 and namespace2 contain two pods that are in the Running state and three pods that are in the Pending state.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx4
      namespace: namespace4
      labels:
        app: nginx4
    spec:
      replicas: 5
      selector:
        matchLabels:
          app: nginx4
      template:
        metadata:
          name: nginx4
          labels:
            app: nginx4
        spec:
          containers:
          - name: nginx4
            image: nginx
            resources:
              limits:
                cpu: 5
              requests:
                cpu: 5
    The result shows the benefits of capacity scheduling in resource allocation.