To use Container Service for Kubernetes (ACK) clusters for GPU computing, you must schedule pods to GPU-accelerated nodes. ACK allows you to schedule pods to specific GPU-accelerated nodes by adding labels to the GPU-accelerated nodes.
- An ACK cluster is created and GPU-accelerated nodes are added to the cluster. For more information, see GPU scheduling for ACK clusters with GPU-accelerated nodes.
- You are connected to a master node. You can check information such as node labels on the master node. For more information, see Use kubectl to connect to an ACK cluster.
When ACK deploys nodes with NVIDIA GPUs, the attributes of these GPUs are discovered and exposed as node labels. These labels have the following benefits:
- You can use the labels to filter GPU-accelerated nodes.
- The labels can be used as conditions to schedule pods.
- Log on to the Container Service for Kubernetes (ACK) console.
- In the left-side navigation pane, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
- On the Nodes page, find the GPU-accelerated node that you want to manage, and choose in the Actions column. Check the labels of the GPU-accelerated node.
You can also log on to a master node and run the following command to view the labels of GPU-accelerated nodes.Run the following command:
kubectl get nodesResponse:
NAME STATUS ROLES AGE VERSION cn-beijing.i-2ze2dy2h9w97v65uuaft Ready master 2d v1.11.2 cn-beijing.i-2ze8o1a45qdv5q8a7luz Ready <none> 2d v1.11.2 #Compare these nodes with the nodes that are displayed in the ACK console to identify GPU-accelerated nodes. cn-beijing.i-2ze8o1a45qdv5q8a7lv0 Ready <none> 2d v1.11.2 cn-beijing.i-2ze9xylyn11vop7g5bwe Ready master 2d v1.11.2 cn-beijing.i-2zed5sw8snjniq6mf5e5 Ready master 2d v1.11.2 cn-beijing.i-2zej9s0zijykp9pwf7lu Ready <none> 2d v1.11.2Select a GPU-accelerated node and run the following command to query the labels of the node:
kubectl describe node cn-beijing.i-2ze8o1a45qdv5q8a7luzResponse:
Name: cn-beijing.i-2ze8o1a45qdv5q8a7luz Roles: <none> Labels: aliyun.accelerator/nvidia_count=1 #Pay attention to this field. aliyun.accelerator/nvidia_mem=12209MiB aliyun.accelerator/nvidia_name=Tesla-M40 beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=ecs.gn4-c4g1.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=cn-beijing failure-domain.beta.kubernetes.io/zone=cn-beijing-a kubernetes.io/hostname=cn-beijing.i-2ze8o1a45qdv5q8a7luz ......
In this example, the following labels are added to the GPU-accelerated node.
The number of GPU cores.
The size of the GPU memory. Unit: MiB.
The name of the NVIDIA GPU.
GPU-accelerated nodes of the same type have the same GPU name. You can use this label to locate GPU-accelerated nodes.Run the following command:
kubectl get no -l aliyun.accelerator/nvidia_name=Tesla-M40Response:
NAME STATUS ROLES AGE VERSION cn-beijing.i-2ze8o1a45qdv5q8a7luz Ready <none> 2d v1.11.2 cn-beijing.i-2ze8o1a45qdv5q8a7lv0 Ready <none> 2d v1.11.2
- In the left-side navigation pane of the details page, choose Deployments page, click Create from YAML. . On the
- Create a Deployment for a TensorFlow job. The Deployment schedules pods to a GPU-accelerated
node. The following YAML template is used to create the Deployment:
--- # Define the tensorflow deployment apiVersion: apps/v1 kind: Deployment metadata: name: tf-notebook labels: app: tf-notebook spec: replicas: 1 selector: # define how the deployment finds the pods it mangages matchLabels: app: tf-notebook template: # define the pods specifications metadata: labels: app: tf-notebook spec: nodeSelector: #Pay attention. aliyun.accelerator/nvidia_name: Tesla-M40 containers: - name: tf-notebook image: tensorflow/tensorflow:1.4.1-gpu-py3 resources: limits: nvidia.com/gpu: 1 #Pay attention. ports: - containerPort: 8888 hostPort: 8888 env: - name: PASSWORD value: mypassw0rdv
- You can also exclude an application from GPU-accelerated nodes. The following example
shows how to schedule a pod based on node affinity for an NGINX application. For more
information, see the section that describes node affinity in Use a Deployment to create a stateless application.
The following YAML template is used to schedule the pod:
apiVersion: v1 kind: Pod metadata: name: not-in-gpu-node spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: aliyun.accelerator/nvidia_name operator: DoesNotExist containers: - name: not-in-gpu-node image: nginx
- Create a Deployment for a TensorFlow job. The Deployment schedules pods to a GPU-accelerated node.
- In the left-side navigation pane of the details page, choose . On the Pods page, you can find that the pods in the preceding examples are scheduled to the desired nodes. This means that labels can be used to schedule pods to GPU-accelerated nodes.