All Products
Search
Document Center

Alibaba Cloud Linux:Use AC2 images in ACK clusters

Last Updated:Dec 21, 2024

Alibaba Cloud provides Alibaba Cloud AI Containers (AC2) images that are deeply integrated and optimized for use with Alibaba Cloud infrastructure, such as Elastic Compute Service (ECS) and Container Service for Kubernetes (ACK). AC2 images help you greatly reduce the deployment costs of AI application environments. This topic describes how to train PyTorch-based models by using AC2 containers in an ACK cluster.

Create an ACK cluster

To use an AC2 image in ACK, you must first create an ACK cluster that contains at least one available node. For more information, see Create an ACK managed cluster.

Note

When you create a node pool, the default operating system is Alibaba Cloud Linux 3.2104 for nodes. For CPU-based nodes that use the x86-64 architecture, you can use ContainerOS as the operating system to achieve faster startup speed and lower performance overhead.

Connect to and manage the ACK cluster

Use one of the following methods to connect to and manage an ACK cluster:

  • Use kubectl to connect to the ACK cluster. kubectl is a command-line tool provided by Kubernetes to manage Kubernetes clusters. To connect to a Kubernetes cluster by using kubectl, you must install a kubectl client on your machine and determine whether to connect to the cluster over the Internet or the internal network. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

  • Use kubectl on Cloud Shell to connect to the ACK cluster. Cloud Shell is a web-based command-line tool provided by Alibaba Cloud. When you start Cloud Shell for an ACK cluster in the ACK console, Cloud Shell installs kubectl and loads the kubeconfig file of the cluster. For more information, see Use kubectl on Cloud Shell to manage ACK clusters.

Use AC2 images to train models

AC2 provides images for multiple training frameworks. The images integrate different AI runtime frameworks, such as PyTorch and TensorFlow, and come built-in with verified drivers such as NVIDIA and Compute Unified Device Architecture (CUDA) drivers and acceleration libraries for different runtime platforms. This helps developers reduce the amount of time required for environment deployment.

PyTorch framework images

Run a PyTorch CPU image to train a model

  1. On the client machine in which kubectl is installed, create a pod file.

    1. Create and open a file.

      vim pytorch-training-cpu.yaml
    2. Press the I key to enter Insert mode and paste the following content to the file.

      In the file, specify an AC2 image and the commands that you want to run after the image is pulled.

      apiVersion: v1
      kind: Pod
      metadata:
        name: pytorch-training-cpu
        namespace: default
      spec:
        restartPolicy: OnFailure
        containers: 
        - name: pytorch-training
          image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.3.0-alinux3.2304
          command:
            - "/bin/sh"
            - "-c"
          args:
          - "git clone https://github.com/pytorch/examples.git && python3 examples/mnist/main.py --no-cuda"
          workingDir: /root
    3. Press the Esc key to exit Insert mode, enter :wq, and then press the Enter key to save and close the file.

  2. Use kubectl to create a training pod in the ACK cluster that you created.

    After the pod runs, the pod downloads the sample code for PyTorch instances and runs the sample code to train the MNIST model.

    kubectl create -f pytorch-training-cpu.yaml

    The following command output indicates that the pod is created:

    pod/pytorch-training-cpu created
  3. Use kubectl to check the status of the pod.

    kubectl get pods

    The following command output is returned. In the command output, the name of the created training pod is pytorch-training-cpu. The first time a container starts, the image file is pulled from AC2. In this case, ContainerCreating is displayed in the STATUS column. Repeat the preceding command until Running is displayed in the STATUS column.

    NAME                   READY   STATUS    RESTARTS   AGE
    pytorch-training-cpu   1/1     Running   0          5m42s
  4. Run the following command to view the training outputs of the pod:

    kubectl logs pytorch-training-cpu

    The following sample logs are displayed:

    Cloning into 'examples'...
    Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
    ...
    
    Train Epoch: 1 [0/60000 (0%)]   Loss: 2.305400
    Train Epoch: 1 [640/60000 (1%)] Loss: 1.359776
    ...
    
    Train Epoch: 14 [58880/60000 (98%)]     Loss: 0.011213
    Train Epoch: 14 [59520/60000 (99%)]     Loss: 0.000181
    
    Test set: Average loss: 0.0271, Accuracy: 9912/10000 (99%)
  5. After the training is complete, check the status of the pod. If Completed is displayed in the STATUS column, run the following command to delete the pod from the ACK cluster:

    kubectl delete pods pytorch-training-cpu

    The following command output indicates that the pod is deleted:

    pod "pytorch-training-cpu" deleted

Run a PyTorch GPU image to train a model

  1. On the client machine in which kubectl is installed, create a pod file.

    1. Create and open a file.

      vim pytorch-training-gpu.yaml
    2. Press the I key to enter Insert mode and paste the following content to the file.

      In the file, add the configurations to apply for GPU resources, specify an AC2 image, and specify the commands that you want to run after the image is pulled.

      apiVersion: v1
      kind: Pod
      metadata:
        name: pytorch-training-gpu
        namespace: default
      spec:
        restartPolicy: OnFailure
        containers: 
        - name: pytorch-training
          image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/pytorch:2.3.0-cuda12.1.1-alinux3.2304
          command:
            - "/bin/sh"
            - "-c"
          args:
          - "git clone https://github.com/pytorch/examples.git && python3 examples/mnist/main.py"
          resources:
            limits:
              nvidia.com/gpu: 1
          workingDir: /root
    3. Press the Esc key to exit Insert mode, enter :wq, and then press the Enter key to save and close the file.

  2. Use kubectl to create a training pod in the ACK cluster that you created.

    After the pod runs, the pod downloads the sample code for PyTorch instances and runs the sample code to train the MNIST model.

    kubectl create -f pytorch-training-gpu.yaml

    The following command output indicates that the pod is created:

    pod/pytorch-training-gpu created
  3. Use kubectl to check the status of the pod.

    kubectl get pods

    The following command output is returned. In the command output, the name of the created training pod is pytorch-training-gpu. The first time a container starts, the image file is pulled from AC2. In this case, ContainerCreating is displayed in the STATUS column. Repeat the preceding command until Running is displayed in the STATUS column.

    NAME                   READY   STATUS    RESTARTS   AGE
    pytorch-training-gpu   1/1     Running   0          5m42s
  4. Run the following command to view the training outputs of the pod:

    kubectl logs pytorch-training-gpu

    The following sample logs are displayed:

    Cloning into 'examples'...
    Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
    ...
    
    Train Epoch: 1 [0/60000 (0%)]   Loss: 2.282550
    Train Epoch: 1 [640/60000 (1%)] Loss: 1.384815
    ...
    
    Train Epoch: 14 [58880/60000 (98%)]     Loss: 0.001355
    Train Epoch: 14 [59520/60000 (99%)]     Loss: 0.002194
    
    Test set: Average loss: 0.0273, Accuracy: 9915/10000 (99%)
  5. After the training is complete, check the status of the pod. If Completed is displayed in the STATUS column, run the following command to delete the pod from the ACK cluster:

    kubectl delete pods pytorch-training-gpu

    The following command output indicates that the pod is deleted:

    pod "pytorch-training-gpu" deleted