To use GPU computing in Kubernetes clusters, you can schedule applications to nodes with GPUs. To make the scheduling process simple and efficient, you can add labels to these nodes.

Prerequisites

Background information

When deploying nodes with NVIDIA GPUs, the Kubernetes cluster discovers the GPU attributes and exposes them as node labels. The node labels provide the following benefits:

  • You can quickly filter GPU nodes by labels.
  • You can use labels as scheduling conditions when you deploy applications.

Procedure

  1. Log on to the Container Service console.
  2. In the left-side navigation pane, choose Clusters > Nodes and select a cluster to view nodes in the cluster.
    Note This example selects a cluster with three worker nodes, among which two are equipped with GPUs. Note the node IP addresses.
    View a node
  3. Select a GPU node and choose More > Details in the Actions column to go to the Kubernetes dashboard. You can view the labels on the nodes.
    Node details

    You can also log on to a master node and run the following command to view the labels on GPU nodes:

    # kubectl get nodes
    NAME                                STATUS    ROLES     AGE       VERSION
    cn-beijing.i-2ze2dy2h9w97v65uuaft   Ready     master    2d        v1.11.2
    cn-beijing.i-2ze8o1a45qdv5q8a7luz   Ready     <none>    2d        v1.11.2             #Compare the nodes here with the nodes displayed in the console to find the GPU nodes.
    cn-beijing.i-2ze8o1a45qdv5q8a7lv0   Ready     <none>    2d        v1.11.2
    cn-beijing.i-2ze9xylyn11vop7g5bwe   Ready     master    2d        v1.11.2
    cn-beijing.i-2zed5sw8snjniq6mf5e5   Ready     master    2d        v1.11.2
    cn-beijing.i-2zej9s0zijykp9pwf7lu   Ready     <none>    2d        v1.11.2
    					

    Select a GPU node and run the following command to view its labels:

    # kubectl describe node cn-beijing.i-2ze8o1a45qdv5q8a7luz
    Name:               cn-beijing.i-2ze8o1a45qdv5q8a7luz
    Roles:              <none>
    Labels:             aliyun.accelerator/nvidia_count=1                          #This field is important.
                        aliyun.accelerator/nvidia_mem=12209MiB
                        aliyun.accelerator/nvidia_name=Tesla-M40
                        beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/instance-type=ecs.gn4-c4g1.xlarge
                        beta.kubernetes.io/os=linux
                        failure-domain.beta.kubernetes.io/region=cn-beijing
                        failure-domain.beta.kubernetes.io/zone=cn-beijing-a
                        kubernetes.io/hostname=cn-beijing.i-2ze8o1a45qdv5q8a7luz
     ......

    In this example, the GPU node has the following three labels:

    key value
    aliyun.accelerator/nvidia_count The number of GPU cores.
    aliyun.accelerator/nvidia_mem The GPU memory in MiB.
    aliyun.accelerator/nvidia_name The name of the NVIDIA graphics card.

    The same type of GPU nodes have the same graphic cards. You can use this label to filter nodes.

    # kubectl get no -l aliyun.accelerator/nvidia_name=Tesla-M40
    NAME                                STATUS    ROLES     AGE       VERSION
    cn-beijing.i-2ze8o1a45qdv5q8a7luz   Ready     <none>    2d        v1.11.2
    cn-beijing.i-2ze8o1a45qdv5q8a7lv0   Ready     <none>    2d        v1.11.2
    					
  4. Go to the homepage of the Container Service console. In the left-side navigation pane, choose Applications > Deployments and click Create from Template in the upper-right corner.
    1. Create a TensorFlow application and schedule this application to a GPU node.
      Create an application
      This example uses the following YAML template:
      ---
      # Define the tensorflow deployment
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: tf-notebook
        labels:
          app: tf-notebook
      spec:
        replicas: 1
        selector: # define how the deployment finds the pods it mangages
          matchLabels:
            app: tf-notebook
        template: # define the pods specifications
          metadata:
            labels:
              app: tf-notebook
          spec:
            nodeSelector:                                                  #This field is important.
              aliyun.accelerator/nvidia_name: Tesla-M40
            containers:
            - name: tf-notebook
              image: tensorflow/tensorflow:1.4.1-gpu-py3
              resources:
                limits:
                  nvidia.com/gpu: 1                                        #This field is important.
              ports:
              - containerPort: 8888
                hostPort: 8888
              env:
                - name: PASSWORD
                  value: mypassw0rdv
    2. You can also avoid deploying an application to a GPU node. The following example deploys an Nginx Pod and schedules the Pod by using node affinity. For more information, see the part about node affinity in Create deployments by using images.

      This example uses the following YAML template:

      apiVersion: v1
      kind: Pod
      metadata:
        name: not-in-gpu-node
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: aliyun.accelerator/nvidia_name
                  operator: DoesNotExist
        containers:
        - name: not-in-gpu-node
          image: nginx
  5. In the left-side navigation pane, choose Applications > Pods. Select the cluster and namespace to go to the Pods page.
    View Pods

Result

On the Pods page, you can see that the two Pods from preceding examples have been scheduled to the target nodes. You can use labels to schedule Pods to specific GPU nodes easily.