Use Arena to submit standalone PyTorch training jobs - Container Service for Kubernetes

PyTorch is an open source deep learning framework that is widely used in training jobs of various deep learning models. This topic describes how to use Arena to submit standalone PyTorch training jobs that use a single GPU or multiple GPUs and how to use TensorBoard to visualize training jobs.

Prerequisites

A Container Service for Kubernetes (ACK) cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster that contains GPU-accelerated nodes.
Internet access is enabled for nodes in the cluster. For more information, see Enable an existing ACK cluster to access the Internet.
The Arena component is installed. For more information, see Configure the Arena client.
A persistent volume claim (PVC) instance named training-data is created, and the MNIST dataset is stored in the /pytorch_data path. For more information, see Configure a shared NAS volume.

Background information

You must download training code from a remote Git repository and read the training data from the shared storage system, which includes persistent volumes (PVs) and persistent volume claims (PVCs) based on File Storage NAS (NAS). torchrun is a command-line tool provided by PyTorch to simplify and manage distributed training jobs. In this example, torchrun is used to run standalone PyTorch training jobs that use a single GPU or multiple GPUs. For more information about the training code, see main.py.

Procedure

Step 1: View GPU resources

Run the following command to query the GPU resources available in the cluster:

arena top node

Expected output:

NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)

The output shows that the cluster contains two GPU-accelerated nodes. Each GPU-accelerated node contains two idle GPUs that can be used to run training jobs.

Step 2: Submit a PyTorch training job

Run the following command to submit a standalone PyTorch training job that uses a single GPU:

arena submit pytorch \
    --name=pytorch-mnist \
    --namespace=default \
    --workers=1 \
    --gpus=1 \
    --working-dir=/root \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch-with-tensorboard:2.5.1-cuda12.4-cudnn9-runtime \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=v0.13.1 \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/pytorch_data/logs \
    "torchrun /root/code/arena/examples/pytorch/mnist/main.py --epochs 10 --backend nccl --data /mnt/pytorch_data  --dir /mnt/pytorch_data/logs"

Expected output:

service/pytorch-mnist-tensorboard created
deployment.apps/pytorch-mnist-tensorboard created
pytorchjob.kubeflow.org/pytorch-mnist created
INFO[0002] The Job pytorch-mnist has been submitted successfully
INFO[0002] You can run `arena get pytorch-mnist --type pytorchjob -n default` to check the job status

Note

If you want to run a standalone PyTorch training job that uses multiple GPUs, such as two GPUs, you can assign two GPUs and start two training processes by specifying --gpus=2 and --nproc-per-node=2:

arena submit pytorch \
    --name=pytorch-mnist \
    --namespace=default \
    --workers=1 \
    --gpus=2 \
    --nproc-per-node=2 \
    --working-dir=/root \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch-with-tensorboard:2.5.1-cuda12.4-cudnn9-runtime \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=v0.13.1 \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/pytorch_data/logs \
    "torchrun /root/code/arena/examples/pytorch/mnist/main.py --epochs 10 --backend nccl --data /mnt/pytorch_data  --dir /mnt/pytorch_data/logs"

Note

If you are using a non-public Git repository, you can specify the Git username and password by configuring the GIT_SYNC_USERNAME and GIT_SYNC_PASSWORD environment variables.

  arena submit pytorch \
        ...
        --sync-mode=git \
        --sync-source=https://github.com/kubeflow/arena.git \
        --env=GIT_SYNC_BRANCH=v0.13.1 \
        --env=GIT_SYNC_USERNAME=<username> \
        --env=GIT_SYNC_PASSWORD=<password> \
        "torchrun /root/code/arena/examples/pytorch/mnist/main.py --epochs 10 --backend nccl --data /mnt/pytorch_data  --dir /mnt/pytorch_data/logs"

The arena command uses git-sync to synchronize the source code. This allows you to use the environment variables defined in the git-sync project.

Important

In this example, the source code is pulled from a GitHub repository. If the code cannot be pulled due to network-related reasons, you can manually download the code to the shared storage system. After you download the code to the /code/github.com/kubeflow/arena path in NAS, you can submit a training job by using the following code:

arena submit pytorch \
    --name=pytorch-mnist \
    --namespace=default \
    --workers=1 \
    --gpus=1 \
    --working-dir=/root \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch-with-tensorboard:2.5.1-cuda12.4-cudnn9-runtime \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/pytorch_data/logs \
    "torchrun /mnt/code/github.com/kubeflow/arena/examples/pytorch/mnist/main.py --epochs 10 --backend nccl --data /mnt/pytorch_data  --dir /mnt/pytorch_data/logs"

The following table describes the parameters in the preceding sample code block.

Parameter	Required	Description	Default value
`--name`	Yes	The name of the job, which is globally unique.	N/A
`--namespace`	No	The namespace to which the pod belongs.	`default`
`--workers`	No	Specifies the number of worker nodes. The master node is included. For example, a value of 3 indicates that the training job runs on one master node and two worker nodes.	`0`
`--gpus`	No	Specifies the number of GPUs that are used by the worker nodes on which the training job runs.	`0`
`--working-dir`	No	Specifies the directory in which the command is executed.	`/root`
`--image`	Yes	Specifies the address of the image that is used to deploy the runtime.	N/A
`--sync-mode`	No	Specifies the synchronization mode. Valid values: git and rsync. In this example, the git mode is used.	N/A
`--sync-source`	No	Specifies the address of the repository from which the source code is synchronized. This parameter is used together with the --sync-mode parameter. In this example, the git mode is used. You must specify an address that supports Git, such as a GitHub project or an Alibaba Cloud Code project. The project code is downloaded to the `code/` directory under --working-dir. In this example, `/root/code/arena` is used.	N/A
`--data`	No	Mounts a shared PV to the runtime in which the training job runs. The value of this parameter consists of two parts that are separated by a colon (`:`). Specify the name of the PVC on the right side of the colon. To query the name of the PVC, run the `arena data list` command. This command queries the PVCs that are available for the specified cluster. Specify the path to which the PV claimed by the PVC is mounted on the right side of the colon. This way, your training job can retrieve the data stored in the PV claimed by the PVC. Note Run the `arena data list` command to query the PVCs that are available for the specified cluster. `NAME ACCESSMODE DESCRIPTION OWNER AGE training-data ReadWriteMany 35m` If no PVC is available, you can create a PVC. For more information, see Configure a shared NAS volume.	N/A
`--tensorboard`	No	Specifies that TensorBoard is used to visualize training results. You can configure the --logdir parameter to specify the path from which TensorBoard reads event files. If you do not configure this parameter, TensorBoard is not used.	N/A
`--logdir`	No	The path from which TensorBoard reads event files. You must specify this parameter and the --tensorboard parameter.	`/training_logs`

Step 3: View the PyTorch training job

Run the following command to view the training job that is submitted by using Arena:

arena list -n default

Expected output:

NAME           STATUS   TRAINER     DURATION  GPU(Requested)  GPU(Allocated)  NODE
pytorch-mnist  RUNNING  PYTORCHJOB  11s       1               1               192.168.xxx.xxx

Run the following command to query the GPU resources used by the training job:

arena top job -n default

Expected output:

NAME           STATUS   TRAINER     AGE  GPU(Requested)  GPU(Allocated)  NODE
pytorch-mnist  RUNNING  PYTORCHJOB  18s  1               1               192.168.xxx.xxx

Total Allocated/Requested GPUs of Training Jobs: 1/1

Run the following command to query the GPU resource usage of the cluster:

arena top node

Expected output:

NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           1
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/4 (25.0%)

The output shows that the cluster has four GPUs. One GPU is allocated.

Run the following command to view the training job details:

arena get pytorch-mnist -n default

Expected output:

Name:        pytorch-mnist
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     PYTORCHJOB
Duration:    45s
CreateTime:  2025-02-12 11:20:10
EndTime:

Instances:
  NAME                    STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                    ------   ---  --------  --------------  ----
  pytorch-mnist-master-0  Running  45s  true      1               cn-beijing.192.168.xxx.xxx

Tensorboard:
  Your tensorboard will be available on:
  http://192.168.xxx.xxx:31949

Note

If TensorBoard is used in a training job, the URL of the TensorBoard instance is displayed in the job details. Otherwise, the URL is not displayed.

Step 4: View the TensorBoard instance

Run the following command on your on-premises machine to map port 6006 of TensorBoard in the cluster to the on-premises port 9090:
Important
Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.
```
kubectl port-forward -n default svc/pytorch-mnist-tensorboard 9090:6006
```
Enter http://127.0.0.1:9090 into the address bar of the web browser to access TensorBoard.
Note
In the example, the source code that is used to submit the standalone PyTorch job indicates that training results are written to events after every 10 epochs. If you want to modify the value of --epochs, set the value to a multiple of 10. Otherwise, the training results cannot be visualized in TensorBoard.

Step 5: View training job logs

Run the following commands to view the training job logs:

arena logs pytorch-mnist -n default

Expected output:

Train Epoch: 10 [55680/60000 (93%)]     Loss: 0.025778
Train Epoch: 10 [56320/60000 (94%)]     Loss: 0.086488
Train Epoch: 10 [56960/60000 (95%)]     Loss: 0.003240
Train Epoch: 10 [57600/60000 (96%)]     Loss: 0.046731
Train Epoch: 10 [58240/60000 (97%)]     Loss: 0.010752
Train Epoch: 10 [58880/60000 (98%)]     Loss: 0.010934
Train Epoch: 10 [59520/60000 (99%)]     Loss: 0.065813

Accuracy: 9921/10000 (99.21%)

Note

If you want to view job logs in real time, add the -f parameter.
If you want to view only the last lines of the log, add the-t N or --tail N parameter.
For more information, see arena logs --help.

(Optional) Step 6: Clear the environment

If you no longer require the training job, run the following command to delete the training job:

arena delete pytorch-mnist -n default

Expected output:

INFO[0001] The training job pytorch-mnist has been deleted successfully