From version 1.8, Kubernetes will support hardware acceleration devices such as NVIDIA GPU, InfiniBand, and FPGA, by using device plugins. Furthermore, GPU solutions of Kubernetes open source communities will be deprecated in version 1.10, and removed from the master code in version 1.11.
We recommend that you use an Alibaba Cloud Kubernetes cluster combined with GPU to run highly dense computational tasks such as machine learning and image processing. With this method, you can implement one-click deployment, elastic scaling, and other functions, without needing to install NVIDIA drivers or Compute Unified Device Architecture (CUDA) beforehand.
During cluster creation, Container Service performs the following operations:
- Creates Elastic Compute Service (ECS) instances, sets the public key used for SSH logon from the management node to other nodes, and installs and configures the Kubernetes cluster by using CloudInit.
- Creates a security group to allow inbound access to all ICMP ports in a VPC.
- Creates a new VPC and VSwitch if you do not use the existing VPC, and also creates an SNAT entry for the VSwitch.
- Creates VPC routing rules.
- Creates a NAT gateway and Elastic IP (EIP).
- Creates a Resource Access Management (RAM) user and AccessKey (AK). This RAM user has the permissions to query, create, and delete ECS instances, add and delete cloud disks, and all relevant access permissions for Server Load Balancer (SLB) instances, CloudMonitor, VPC, Log Service, and Network Attached Storage (NAS) services. The Kubernetes cluster dynamically creates the SLB instances, cloud disks, and VPC routing rules according to your configurations.
- Creates an intranet SLB instance and exposes port 6443.
- Creates an Internet SLB instance and exposes ports 6443, 8443, and 22. (If you enable the SSH logon for Internet access when creating the cluster, port 22 is exposed. Otherwise, port 22 is not exposed.)
You have activated Container Service, Resource Orchestration Service (ROS), and RAM.
- The SLB instance created with the Kubernetes cluster only supports the Pay-As-You-Go billing method.
- The Kubernetes cluster supports only Virtual Private Cloud (VPC).
- By default, each account has a specified quota of the number of cloud resources that
it can create. If the number of cloud resources has reached the quota limit, the account
cannot create a cluster. Make sure you have sufficient resource quota to create a
cluster. You can open a ticket to increase your quota.
- By default, each account can create up to 5 clusters across all regions and add up to 40 nodes to each cluster. You can open a ticket to create more clusters or nodes.
- By default, each account can create up to 100 security groups.
- By default, each account can create up to 60 Pay-As-You-Go SLB instances.
- By default, each account can create up to 20 EIPs.
- The limits for ECS instances are as follows:
- Only the CentOS operating system is supported.
- Only Pay-As-You-Go ECS instances can be created.
Note After creating an instance, you can Switch from Pay-As-You-Go to Subscription billing in the ECS console.
Create a GN5 Kubernetes cluster
- Log on to the Container Service console.
- In the left-side navigation pane under Kubernetes, click Clusters.
- Click Create Kubernetes Cluster in the upper-right corner.
By default, the Create Kubernetes Cluster page is displayed.Note Worker nodes are set to use GPU ECS instances to create a GPU cluster. For information about other parameter settings, see Create a Kubernetes cluster.
- Set the Worker nodes. In this example, the gn5 GPU instance type is selected to set
Worker nodes as GPU working nodes.
- If you choose to create Worker instances, you must select the instance type and the number of Worker nodes. In this example, two GPU nodes are created.
- If you choose to add existing instances, you need to have already created GPU cloud servers in the same region where the cluster is to be created.
- After you have completed all required settings, click Create to start cluster deployment.
- After the cluster is created, choose in the left-side navigation pane.
- To view the GPU devices mounted to either of the created nodes, select the created cluster from the clusters drop-down list, select one of the created Worker nodes, and choose in the action column.
Create a GPU experimental environment to run TensorFLow
Jupyter is a popular tool used by data scientists for the experimental environment TensorFlow. This topic describes an example of how to deploy a Jupyter application.
- Log on to the Container Service console.
- In the left-side navigation pane under Kubernetes, choose .
- Click Create by Template in the upper-right corner.
- Select the target cluster and namespace and then select a sample template or the custom
template from the resource type drop-down list. After you orchestrate your template,
In this example, a Jupyter application template is orchestrated. The template includes a deployment and a service.
--- # Define the tensorflow deployment apiVersion: apps/v1 kind: Deployment metadata: name: tf-notebook labels: app: tf-notebook spec: replicas: 1 selector: # define how the deployment finds the pods it manages matchLabels: app: tf-notebook template: # define the pods specifications metadata: labels: app: tf-notebook spec: containers: - name: tf-notebook image: tensorflow/tensorflow:1.4.1-gpu-py3 resources: limits: nvidia.com/gpu: 1 #specify the number of NVIDIA GPUs that are called by the application ports: - containerPort: 8888 hostPort: 8888 env: - name: PASSWORD #specify the password used to access the Jupyter service. You can modify the password as needed. value: mypassw0rd # Define the tensorflow service --- apiVersion: v1 kind: Service metadata: name: tf-notebook spec: ports: - port: 80 targetPort: 8888 name: jupyter selector: app: tf-notebook type: LoadBalancer #set Alibaba Cloud SLB service for the application so that its services are accessible from the Internet.
If you use a GPU deployment solution of Kubernetes earlier than 1.9.3, you must define the following volumes in which the NVIDIA drivers reside:
volumes: - hostPath: path: /usr/lib/nvidia-375/bin name: bin - hostPath: path: /usr/lib/nvidia-375 name: lib
When you orchestrate your deployment template in a cluster by using the GPU deployment solution of Kubernetes earlier than 1.9.3, your template must be highly dependent on the cluster. As a result, portability of the template is not achievable. However, in Kubernetes version 1.9.3 and later, you do not need to specify these hostPaths because the NIVEA plugins automatically discover the library links and execution files required by the drivers.
- In the left-side navigation pane under Container Service-Kubernetes, choose . Then, select the target cluster and namespace, and then view the external endpoint of the tf-notebook service.
- Access the Jupyter application in a browser. The access address is
http://EXTERNAL-IP. You need to enter the password set in the template.
- By running the following program, you can verify that this Jupyter application can
use GPU, and the program is able to list all devices that can be used by Tensorflow:
from tensorflow.python.client import device_lib def get_available_devices(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos] print(get_available_devices())