Alibaba Cloud provides Alibaba Cloud AI Containers (AC2) images that are deeply integrated and optimized for use with Alibaba Cloud infrastructure, such as Elastic Compute Service (ECS) and Container Service for Kubernetes (ACK). AC2 images help you greatly reduce the deployment costs of AI application environments. This topic describes how to train PyTorch-based models by using AC2 containers in an ACK cluster.
Create an ACK cluster
To use an AC2 image in ACK, you must first create an ACK cluster that contains at least one available node. For more information, see Create an ACK managed cluster.
When you create a node pool, the default operating system is Alibaba Cloud Linux 3.2104 for nodes. For CPU-based nodes that use the x86-64 architecture, you can use ContainerOS as the operating system to achieve faster startup speed and lower performance overhead.
Connect to and manage the ACK cluster
Use one of the following methods to connect to and manage an ACK cluster:
Use kubectl to connect to the ACK cluster. kubectl is a command-line tool provided by Kubernetes to manage Kubernetes clusters. To connect to a Kubernetes cluster by using kubectl, you must install a kubectl client on your machine and determine whether to connect to the cluster over the Internet or the internal network. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Use kubectl on Cloud Shell to connect to the ACK cluster. Cloud Shell is a web-based command-line tool provided by Alibaba Cloud. When you start Cloud Shell for an ACK cluster in the ACK console, Cloud Shell installs kubectl and loads the kubeconfig file of the cluster. For more information, see Use kubectl on Cloud Shell to manage ACK clusters.
Use AC2 images to train models
AC2 provides images for multiple training frameworks. The images integrate different AI runtime frameworks, such as PyTorch and TensorFlow, and come built-in with verified drivers such as NVIDIA and Compute Unified Device Architecture (CUDA) drivers and acceleration libraries for different runtime platforms. This helps developers reduce the amount of time required for environment deployment.
PyTorch framework images
Run a PyTorch CPU image to train a model
On the client machine in which kubectl is installed, create a pod file.
Create and open a file.
vim pytorch-training-cpu.yaml
Press the
I
key to enter Insert mode and paste the following content to the file.In the file, specify an AC2 image and the commands that you want to run after the image is pulled.
apiVersion: v1 kind: Pod metadata: name: pytorch-training-cpu namespace: default spec: restartPolicy: OnFailure containers: - name: pytorch-training image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.3.0-alinux3.2304 command: - "/bin/sh" - "-c" args: - "git clone https://github.com/pytorch/examples.git && python3 examples/mnist/main.py --no-cuda" workingDir: /root
Press the
Esc
key to exit Insert mode, enter:wq
, and then press theEnter
key to save and close the file.
Use kubectl to create a training pod in the ACK cluster that you created.
After the pod runs, the pod downloads the sample code for PyTorch instances and runs the sample code to train the MNIST model.
kubectl create -f pytorch-training-cpu.yaml
The following command output indicates that the pod is created:
pod/pytorch-training-cpu created
Use kubectl to check the status of the pod.
kubectl get pods
The following command output is returned. In the command output, the name of the created training pod is
pytorch-training-cpu
. The first time a container starts, the image file is pulled from AC2. In this case,ContainerCreating
is displayed in theSTATUS
column. Repeat the preceding command untilRunning
is displayed in the STATUS column.NAME READY STATUS RESTARTS AGE pytorch-training-cpu 1/1 Running 0 5m42s
Run the following command to view the training outputs of the pod:
kubectl logs pytorch-training-cpu
The following sample logs are displayed:
Cloning into 'examples'... Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ... Train Epoch: 1 [0/60000 (0%)] Loss: 2.305400 Train Epoch: 1 [640/60000 (1%)] Loss: 1.359776 ... Train Epoch: 14 [58880/60000 (98%)] Loss: 0.011213 Train Epoch: 14 [59520/60000 (99%)] Loss: 0.000181 Test set: Average loss: 0.0271, Accuracy: 9912/10000 (99%)
After the training is complete, check the status of the pod. If
Completed
is displayed in theSTATUS
column, run the following command to delete the pod from the ACK cluster:kubectl delete pods pytorch-training-cpu
The following command output indicates that the pod is deleted:
pod "pytorch-training-cpu" deleted
Run a PyTorch GPU image to train a model
On the client machine in which kubectl is installed, create a pod file.
Create and open a file.
vim pytorch-training-gpu.yaml
Press the
I
key to enter Insert mode and paste the following content to the file.In the file, add the configurations to apply for GPU resources, specify an AC2 image, and specify the commands that you want to run after the image is pulled.
apiVersion: v1 kind: Pod metadata: name: pytorch-training-gpu namespace: default spec: restartPolicy: OnFailure containers: - name: pytorch-training image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.3.0-cuda12.1.1-alinux3.2304 command: - "/bin/sh" - "-c" args: - "git clone https://github.com/pytorch/examples.git && python3 examples/mnist/main.py" resources: limits: nvidia.com/gpu: 1 workingDir: /root
Press the
Esc
key to exit Insert mode, enter:wq
, and then press theEnter
key to save and close the file.
Use kubectl to create a training pod in the ACK cluster that you created.
After the pod runs, the pod downloads the sample code for PyTorch instances and runs the sample code to train the MNIST model.
kubectl create -f pytorch-training-gpu.yaml
The following command output indicates that the pod is created:
pod/pytorch-training-gpu created
Use kubectl to check the status of the pod.
kubectl get pods
The following command output is returned. In the command output, the name of the created training pod is
pytorch-training-gpu
. The first time a container starts, the image file is pulled from AC2. In this case,ContainerCreating
is displayed in theSTATUS
column. Repeat the preceding command untilRunning
is displayed in the STATUS column.NAME READY STATUS RESTARTS AGE pytorch-training-gpu 1/1 Running 0 5m42s
Run the following command to view the training outputs of the pod:
kubectl logs pytorch-training-gpu
The following sample logs are displayed:
Cloning into 'examples'... Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ... Train Epoch: 1 [0/60000 (0%)] Loss: 2.282550 Train Epoch: 1 [640/60000 (1%)] Loss: 1.384815 ... Train Epoch: 14 [58880/60000 (98%)] Loss: 0.001355 Train Epoch: 14 [59520/60000 (99%)] Loss: 0.002194 Test set: Average loss: 0.0273, Accuracy: 9915/10000 (99%)
After the training is complete, check the status of the pod. If
Completed
is displayed in theSTATUS
column, run the following command to delete the pod from the ACK cluster:kubectl delete pods pytorch-training-gpu
The following command output indicates that the pod is deleted:
pod "pytorch-training-gpu" deleted