This topic describes how to create a Kubernetes cluster with GPU capabilities.

Prerequisites

You have the permission to create a pay-as-you-go GN5 instance. To request the permission, submit a ticket to ECS technical support.

Background information

The Kubernetes deep learning solution supports Kubernetes clusters that use Elastic Compute Service (ECS) instances or Elastic GPU Service (EGS) instances as nodes. This topic describes how to create a Kubernetes cluster with GPU capabilities.
Note For more information about how to create a Kubernetes cluster that contains ECS instances as nodes, see Create a Kubernetes cluster.

The GPU scheduling solution in Kubernetes is based on the official NVIDIA GPU device plug-in and nvidia-container-runtime. Compared with the open-source community solution, this solution requires less configuration work.

Based on this solution, you can use container technology to build images for applications. You can also integrate Kubernetes with GPU clusters to execute compute-intensive tasks, such as machine learning and image processing. This solution enables you to achieve quick deployment and auto scaling without the need to install NVIDIA drivers or CUDA.

Limits

  • Currently, GN5 instances with GPU capabilities only support VPC networks.
  • You must complete real-name verification. Otherwise, you cannot create pay-as-you-go ECS or SLB instances.
  • The Kubernetes version must be 1.9.3 or later.

Procedure

  1. Log on to the Container Service console.
  2. In the upper-right corner, click Create Kubernetes Cluster.
    Clusters page
  3. Set cluster parameters. For more information, see Create a Kubernetes cluster.
    Cluster Configurations
    • In the Cluster Configurations step, select the SSH Logon check box. This enables you to use SSH to log to master nodes of the cluster.ssh
    • In the Worker Configurations step, select a GN5 instance with GPU capabilities.Instance type
  4. Set the other parameters and click Next: Confirm Order.
  5. Click Create Cluster. In a few minutes, the newly created cluster will be displayed on the Clusters page.
  6. Select the new cluster and click Manage in the Actions column. On the Basic Information page that appears, you can find the master node IP address for SSH logon.
  7. To query the GPU nodes in the cluster, use SSH to log on to a master node and run the following command:
     $ kubectl get nodes -l 'aliyun.accelerator/nvidia_name' --show-labels
     ...                               
     NAME                                 STATUS    ROLES     AGE       VERSION   LABELS
     cn-hangzhou.i-bp12xvjjwqe6j7nca2q8   Ready     <none>    1h        v1.9.3    aliyun.accelerator/nvidia_count=1,aliyun.accelerator/nvidia_mem=16276MiB,aliyun.accelerator/nvidia_name=Tesla-P100-PCIE-16GB,..
  8. To query detailed information about a GPU node, use the following command:
    $ kubectl get node ${node_name} -o=yaml
    ... status:
       addresses:
       - address: 172.16.166.23
         type: InternalIP
       allocatable:
         cpu: "8"
         memory: 61578152Ki
         nvidia.com/gpu: "1"
         pods: "110"
       capacity:
         cpu: "8"
         memory: 61680552Ki
         nvidia.com/gpu: "1"
         pods: "110"
     ...

You have now created a Kubernetes cluster with GPU capabilities.