The Kubernetes deep learning solution supports creating a Kubernetes cluster that contains Elastic Compute Service (ECS) instances or Elastic GPU Service (GPU) instances as nodes. This topic uses the Kubernetes GPU cluster as an example to describe how to create a Kubernetes cluster.

Note For more information about how to create a Kubernetes cluster that contains ECS instances as nodes, see Create a Kubernetes cluster.

The Kubernetes GPU scheduling solution is based on both the official plug-in and nvidia-container-runtime provided by NVIDIA. In comparison to the open source Kubernetes solutions, the Kubernetes GPU scheduling solution provided by Alibaba Cloud requires less configuration.

Based on this solution, you can use the container technology to build an image for a containerized application. Integrating Kubernetes with GPU clusters to perform high-density computing tasks, such as deep learning and imaging. This solution helps you simplify the deployment and auto scaling configuration without the need to install NVIDIA drivers and CUDA.

The following sections describe how to create a Kubernetes GPU cluster in Alibaba Cloud Container Service.

Limits

  • Currently, you can only create gn5 Kubernetes GPU clusters in VPCs.
  • To create a pay-as-you-go instance or SLB instance, you must have an Alibaba Cloud account with a balance of CNY 100. The account must have passed the real-name verification.
  • The Kubernetes cluster version for deep learning solution must be 1.9.3 or later version.

Prerequisites

To buy pay-as-you-go GPU instances of the gn instance type, you need to submit a ticket and enter the following reason.

I need to apply for pay-as-you-go GPU instances of the gn5 instance type. Please activate it. Thank you.

Procedure

  1. Log on to the Container Service console.
  2. In the top of the left-side navigation pane, switch to Container Service for Kubernetes and select Clusters to open the Clusters page.
  3. In the upper-right corner, click Create Kubernetes Cluster.


  4. Configure the basic information, as shown in the following figure.


  5. Select an instance type for the worker node. This example uses a gn5 instance type.


  6. To enable SSH logon for connecting to Kubernetes master nodes, select the Allow Using SSH to Log on to Clusters from the Internet check box. Note that you must enable public access by selecting Expose API Server with EIP before you enable SSH logon.


  7. For more information about how to configure other information, see Create a Kubernetes cluster.
  8. After the cluster is created, click Clusters in the left-side navigation pane, and you can see the created GPU cluster in the cluster list.


  9. In the Action column of the cluster, click Manage to open the Basic Information page and view the IP address of the master node for SSH logon.


  10. To view the GPU nodes in the Kubernetes cluster, use an SSH client to connect to the master node and run the following command:
     $ kubectl get nodes -l 'aliyun.accelerator/nvidia_name' --show-labels
     ...                               
     NAME                                 STATUS    ROLES     AGE       VERSION   LABELS
     cn-hangzhou.i-bp12xvjjwqe6j7nca2q8   Ready     <none>    1h        v1.9.3    aliyun.accelerator/nvidia_count=1,aliyun.accelerator/nvidia_mem=16276MiB,aliyun.accelerator/nvidia_name=Tesla-P100-PCIE-16GB, ..
  11. View the status of the GPU node.
    $ kubectl get node ${node_name} -o=yaml
     ...
     status:
       addresses:
       - address: 172.16.166.23
         type: InternalIP
       allocatable:
         cpu: "8"
         memory: 61578152Ki
         nvidia.com/gpu: "1"
         pods: "110"
       capacity:
         cpu: "8"
         memory: 61680552Ki
         nvidia.com/gpu: "1"
         pods: "110"
     ...

The Kubernetes GPU cluster is created.