All Products
Search
Document Center

Container Service for Kubernetes:Use Arena to submit distributed PyTorch training jobs

Last Updated:Mar 03, 2025

PyTorch is an open source deep learning framework that is widely used in training jobs of various deep learning models. This topic describes how to submit PyTorch training jobs that use multiple GPUs by using Arena and how to use TensorBoard to visualize training jobs.

Prerequisites

Background information

You must download training code from a remote Git repository and read the training data from the shared storage system, which includes persistent volumes (PVs) and persistent volume claims (PVCs) based on File Storage NAS (NAS). torchrun is a command-line tool provided by PyTorch to simplify and manage distributed training jobs. In this example, torchrun is used to run PyTorch training jobs that use multiple GPUs. For more information about the training code, see main.py.

Example

Step 1: View GPU resources

Run the following command to query the GPU resources available in the cluster:

arena top node

Expected output:

NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)

The output shows that the cluster contains two GPU-accelerated nodes. Each GPU-accelerated node contains two idle GPUs that can be used to run training jobs.

Step 2: Submit a PyTorch training job

Run the arena submit pytorch command to submit a PyTorch training job that uses multiple GPUs. The job contains two worker pods and each worker pod uses two GPU cards.

arena submit pytorch \
    --name=pytorch-mnist \
    --namespace=default \
    --workers=2 \
    --gpus=2 \
    --nproc-per-node=2 \
    --clean-task-policy=None \
    --working-dir=/root \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch-with-tensorboard:2.5.1-cuda12.4-cudnn9-runtime \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=v0.13.1 \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/pytorch_data/logs \
    "torchrun /root/code/arena/examples/pytorch/mnist/main.py --epochs 10 --backend nccl --data /mnt/pytorch_data  --dir /mnt/pytorch_data/logs"

Expected output:

service/pytorch-mnist-tensorboard created
deployment.apps/pytorch-mnist-tensorboard created
pytorchjob.kubeflow.org/pytorch-mnist created
INFO[0002] The Job pytorch-mnist has been submitted successfully
INFO[0002] You can run `arena get pytorch-mnist --type pytorchjob -n default` to check the job status
Note

Compared with standalone training, PyTorch distributed training jobs require additional --workers and --nproc-per-node parameters. The parameters indicate the number of pods that participate in distributed training and the number of processes that are started on each node. A distributed training job contains several nodes. The node name is in the <job_name>-<role_name>-<index> format. <job_name> indicates the job name, <role_name> indicates the role of the node in distributed training, including master and worker, and <index> indicates the node sequence number. For example, when you configure the --workers=3 and --nproc-per-node=2 parameters for a training job named pytorch-mnist, three training nodes are created and two processes are started on each node. The node names are pytorch-mnist-master-0, pytorch-mnist-worker-0, and pytorch-mnist-worker-1. The corresponding environment variables are injected into each node, as shown in the following table:

Environment variable/Node name

pytorch-mnist-master-0

pytorch-mnist-worker-0

pytorch-mnist-worker-1

MASTER_ADDR

pytorch-mnist-master-0

MASTER_PORT

23456

WORLD_SIZE

6

RANK

0

1

2

PET_MASTER_ADDR

pytorch-mnist-master-0

PET_MASTER_PORT

23456

PET_NNODES

3

PET_NODE_RANK

0

1

2

Note

If you are using a non-public Git repository, you can specify the Git username and password by configuring the GIT_SYNC_USERNAME and GIT_SYNC_PASSWORD environment variables.

  arena submit pytorch \
        ...
        --sync-mode=git \
        --sync-source=https://github.com/kubeflow/arena.git \
        --env=GIT_SYNC_BRANCH=v0.13.1 \
        --env=GIT_SYNC_USERNAME=<username> \
        --env=GIT_SYNC_PASSWORD=<password> \
        "torchrun /root/code/arena/examples/pytorch/mnist/main.py --epochs 10 --backend nccl --data /mnt/pytorch_data  --dir /mnt/pytorch_data/logs"

The arena command uses git-sync to synchronize the source code. This allows you to use the environment variables defined in the git-sync project.

Important

In this example, the source code is pulled from a GitHub repository. If the code cannot be pulled due to network-related reasons, you can manually download the code to the shared storage system. After you download the code to the /code/github.com/kubeflow/arena path in NAS, you can submit a training job by using the following code:

arena submit pytorch \
    --name=pytorch-mnist \
    --namespace=default \
    --workers=2 \
    --gpus=2 \
    --nproc-per-node=2 \
    --clean-task-policy=None \
    --working-dir=/root \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch-with-tensorboard:2.5.1-cuda12.4-cudnn9-runtime \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/pytorch_data/logs \
    "torchrun /mnt/code/github.com/kubeflow/arena/examples/pytorch/mnist/main.py --epochs 10 --backend nccl --data /mnt/pytorch_data  --dir /mnt/pytorch_data/logs"

The following table describes the parameters.

Parameter

Required

Description

Default value

--name

Yes

The name of the job, which is globally unique.

N/A

--namespace

No

The namespace to which the pod belongs.

default

--workers

No

Specifies the number of worker nodes. The master node is included. For example, a value of 3 indicates that the training job runs on one master node and two worker nodes.

0

--gpus

No

Specifies the number of GPUs that are used by the worker nodes on which the training job runs.

0

--working-dir

No

Specifies the directory in which the command is executed.

/root

--image

Yes

Specifies the address of the image that is used to deploy the runtime.

N/A

--sync-mode

No

Specifies the synchronization mode. Valid values: git and rsync. In this example, the git mode is used.

N/A

--sync-source

No

Specifies the address of the repository from which the source code is synchronized. This parameter is used together with the --sync-mode parameter. In this example, the git mode is used. You must specify an address that supports Git, such as a GitHub project or an Alibaba Cloud Code project. The project code is downloaded to the code/ directory under --working-dir. In this example, /root/code/arena is used.

N/A

--data

No

Mounts a shared PV to the runtime in which the training job runs. The value of this parameter consists of two parts that are separated by a colon (:). Specify the name of the PVC on the right side of the colon. To query the name of the PVC, run the arena data list command. This command queries the PVCs that are available for the specified cluster. Specify the path to which the PV claimed by the PVC is mounted on the right side of the colon. This way, your training job can retrieve the data stored in the PV claimed by the PVC.

Note

Run the arena data list command to query the PVCs that are available for the specified cluster.

NAME           ACCESSMODE     DESCRIPTION  OWNER  AGE
training-data  ReadWriteMany                      35m

If no PVC is available, you can create a PVC. For more information, see Configure a shared NAS volume.

N/A

--tensorboard

No

Specifies that TensorBoard is used to visualize training results. You can configure the --logdir parameter to specify the path from which TensorBoard reads event files. If you do not configure this parameter, TensorBoard is not used.

N/A

--logdir

No

The path from which TensorBoard reads event files. You must specify this parameter and the --tensorboard parameter.

/training_logs

Step 3: View the PyTorch training job

  1. Run the following command to query all training jobs submitted by using Arena:

    arena list -n default

    Expected output:

    NAME           STATUS   TRAINER     DURATION  GPU(Requested)  GPU(Allocated)  NODE
    pytorch-mnist  RUNNING  PYTORCHJOB  48s       4               4               192.168.xxx.xxx
  2. Run the following command to query the GPU resources that are used by the jobs:

    arena top job -n default

    Expected output:

    NAME           STATUS   TRAINER     AGE  GPU(Requested)  GPU(Allocated)  NODE
    pytorch-mnist  RUNNING  PYTORCHJOB  55s  4               4               192.168.xxx.xxx
    
    Total Allocated/Requested GPUs of Training Jobs: 4/4
  3. Run the following command to query the GPU resources in the cluster:

    arena top node

    Expected output:

    NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           2
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           2
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    4/4 (100.0%)

    The command output indicates that all four GPUs are allocated.

  4. Run the following command to query detailed information about a job:

    arena get pytorch-mnist -n default

    Expected output:

    Name:        pytorch-mnist
    Status:      RUNNING
    Namespace:   default
    Priority:    N/A
    Trainer:     PYTORCHJOB
    Duration:    1m
    CreateTime:  2025-02-12 13:54:51
    EndTime:
    
    Instances:
      NAME                    STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
      ----                    ------   ---  --------  --------------  ----
      pytorch-mnist-master-0  Running  1m   true      2               cn-beijing.192.168.xxx.xxx
      pytorch-mnist-worker-0  Running  1m   false     2               cn-beijing.192.168.xxx.xxx
    
    Tensorboard:
      Your tensorboard will be available on:
      http://192.168.xxx.xxx:32084

    The command output indicates that the job contains a master pod named pytorch-mnist-master-0 and a worker pod named pytorch-mnist-worker-0. The pods respectively request two GPUs to participate in the entire training job.

    Note

    If TensorBoard is used in a training job, the URL of the TensorBoard instance is displayed in the job details. Otherwise, the URL is not displayed.

Step 4: View the TensorBoard instance

  1. Run the following command on your on-premises machine to map port 6006 of TensorBoard in the cluster to the on-premises port 9090:

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.

    kubectl port-forward -n default svc/pytorch-mnist-tensorboard 9090:6006
  2. Enter http://127.0.0.1:9090 into the address bar of the web browser to access TensorBoard.

    pytorch单机

    Note

    In the example, the source code that is used to submit the standalone PyTorch job indicates that training results are written to events after every 10 epochs. If you want to modify the value of --epochs, set the value to a multiple of 10. Otherwise, the training results cannot be visualized in TensorBoard.

Step 5: View training job logs

  1. Run the following command to query the master pod log of the training job:

    arena logs -n default pytorch-mnist 

    Expected output:

    {'PID': 40, 'MASTER_ADDR': 'pytorch-mnist-master-0', 'MASTER_PORT': '23456', 'LOCAL_RANK': 0, 'RANK': 0, 'GROUP_RANK': 0, 'ROLE_RANK': 0, 'LOCAL_WORLD_SIZE': 2, 'WORLD_SIZE': 4, 'ROLE_WORLD_SIZE': 4}
    {'PID': 41, 'MASTER_ADDR': 'pytorch-mnist-master-0', 'MASTER_PORT': '23456', 'LOCAL_RANK': 1, 'RANK': 1, 'GROUP_RANK': 0, 'ROLE_RANK': 1, 'LOCAL_WORLD_SIZE': 2, 'WORLD_SIZE': 4, 'ROLE_WORLD_SIZE': 4}
    Using cuda:0.
    Using cuda:1.
    Train Epoch: 1 [0/60000 (0%)]   Loss: 2.283599
    Train Epoch: 1 [0/60000 (0%)]   Loss: 2.283599
    ...
    Train Epoch: 10 [59520/60000 (99%)]     Loss: 0.007343
    Train Epoch: 10 [59520/60000 (99%)]     Loss: 0.007343
    
    Accuracy: 9919/10000 (99.19%)
    
    
    Accuracy: 9919/10000 (99.19%)
  2. View the log of the worker pod whose index is 0:

    arena logs -n default -i pytorch-mnist-worker-0 pytorch-mnist

    Expected output:

    {'PID': 39, 'MASTER_ADDR': 'pytorch-mnist-master-0', 'MASTER_PORT': '23456', 'LOCAL_RANK': 0, 'RANK': 2, 'GROUP_RANK': 1, 'ROLE_RANK': 2, 'LOCAL_WORLD_SIZE': 2, 'WORLD_SIZE': 4, 'ROLE_WORLD_SIZE': 4}
    {'PID': 40, 'MASTER_ADDR': 'pytorch-mnist-master-0', 'MASTER_PORT': '23456', 'LOCAL_RANK': 1, 'RANK': 3, 'GROUP_RANK': 1, 'ROLE_RANK': 3, 'LOCAL_WORLD_SIZE': 2, 'WORLD_SIZE': 4, 'ROLE_WORLD_SIZE': 4}
    Using cuda:0.
    Using cuda:1.
    Train Epoch: 1 [0/60000 (0%)]   Loss: 2.283599
    Train Epoch: 1 [0/60000 (0%)]   Loss: 2.283599
    ...
    Train Epoch: 10 [58880/60000 (98%)]     Loss: 0.051877Train Epoch: 10 [58880/60000 (98%)]       Loss: 0.051877
    
    Train Epoch: 10 [59520/60000 (99%)]     Loss: 0.007343Train Epoch: 10 [59520/60000 (99%)]       Loss: 0.007343
    
    
    Accuracy: 9919/10000 (99.19%)
    
    
    Accuracy: 9919/10000 (99.19%)
    Note
    • If you want to view job logs in real time, add the -f parameter.

    • If you want to view only the last lines of the log, add the-t N or --tail N parameter.

    • For more information, see arena logs --help.

(Optional) Step 6: Clear the environment

If you no longer require the training job, run the following command to delete the training job:

arena delete pytorch-mnist -n default

Expected output:

INFO[0001] The training job pytorch-mnist has been deleted successfully