This document walks you through submitting a TensorFlow distributed training job in PS-Worker mode using Arena and monitoring training progress with TensorBoard.
Prerequisites
Before you begin, ensure that you have:
A Kubernetes cluster with GPU nodes. See Create a Kubernetes cluster that contains GPUs.
Cluster nodes with public network access. See Enable Internet access for a cluster.
The Arena client installed. See Configure the Arena client.
A Persistent Volume Claim (PVC) named
training-data, with the MNIST dataset stored under thetf_datapath. See Configure NAS shared storage.
Step 1: Check available GPU resources
Run the following command to view GPU availability across all nodes:
arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 0 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 0 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 2 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 2 0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)The cluster has two GPU nodes, each with two idle GPUs available for training.
Step 2: Submit a TensorFlow distributed training job
The general command format is arena submit tfjob/tf [--flag] command.
For distributed training in PS-Worker mode, the following parameters are required:
--name— a globally unique job name--ps— number of parameter server (PS) nodes--workers— number of worker nodes--ps-image— container image for the PS node (or--imageif PS and workers share the same image)--worker-image— container image for worker nodes (or--imageif PS and workers share the same image)--gpus— number of GPUs per worker (required for GPU workloads)
Submit a distributed training job with one PS node and two worker nodes:
arena submit tf \
--name=tf-mnist-dist \
--namespace=default \
--working-dir=/root \
--ps=1 \
--ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--workers=2 \
--worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--gpus=1 \
--sync-mode=git \
--sync-source=https://github.com/kubeflow/arena.git \
--env=GIT_SYNC_BRANCH=master \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"Expected output:
service/tf-mnist-dist-tensorboard created
deployment.apps/tf-mnist-dist-tensorboard created
tfjob.kubeflow.org/tf-mnist-dist created
INFO[0004] The Job tf-mnist-dist has been submitted successfully
INFO[0004] You can run `arena get tf-mnist-dist --type tfjob -n default` to check the job statusThe following table describes the key parameters. For a full parameter reference, run arena submit tf --help.
| Parameter | Required | Description | Default |
|---|---|---|---|
--name | Yes | Unique job name within the cluster. | None |
--working-dir | No | Directory where the training command runs. Synced code is placed under code/ inside this directory. | /root |
--gpus | No | Number of GPUs allocated per worker node. | 0 |
--workers | No | Number of worker nodes. | 1 |
--image | Required if --worker-image and --ps-image are not specified separately | Container image for both worker and PS nodes. Overridden by --worker-image or --ps-image if specified. | None |
--worker-image | Required if --image is not specified | Container image for worker nodes. Takes precedence over --image. | None |
--ps | Yes (distributed jobs) | Number of PS nodes. | 0 |
--ps-image | Required if --image is not specified | Container image for PS nodes. Takes precedence over --image. | None |
--sync-mode | No | Code synchronization mode: git or rsync. | None |
--sync-source | No | Source repository URL for code synchronization. Use with --sync-mode. The code is downloaded to code/ under --working-dir. | None |
--data | No | Mounts a PVC into the training environment. Format: <pvc-name>:<mount-path>. Run arena data list to see available PVCs. | None |
--tensorboard | No | Enables a TensorBoard service for the training job. Use --logdir to specify where TensorBoard reads event data. | None |
--logdir | No | Path where TensorBoard reads event data. Use with --tensorboard. | /training_logs |
Using a private Git repository
Arena uses git-sync to pull source code. For private repositories, pass credentials as environment variables:
arena submit tf \
--name=tf-mnist-dist \
--namespace=default \
--working-dir=/root \
--ps=1 \
--ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--workers=2 \
--worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--gpus=1 \
--sync-mode=git \
--sync-source=https://github.com/kubeflow/arena.git \
--env=GIT_SYNC_BRANCH=master \
--env=GIT_SYNC_USERNAME=<your-username> \
--env=GIT_SYNC_PASSWORD=<your-password> \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"You can set any environment variable defined in the git-sync project using --env.
This example pulls source code from GitHub. If the pull fails due to network issues, the demo image already contains the example code at /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py. Submit the job without --sync-mode and --sync-source:
arena submit tf \
--name=tf-mnist-dist \
--namespace=default \
--working-dir=/root \
--ps=1 \
--ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--workers=2 \
--worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--gpus=1 \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"Step 3: Monitor the training job
List all submitted jobs:
arena listExpected output:
NAME STATUS TRAINER AGE NODE
tf-dist RUNNING TFJOB 58s 192.1xx.x.xxCheck job details and locate each instance:
arena get -n default tf-mnist-distExpected output:
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
tf-dist RUNNING TFJOB 1m tf-dist-ps-0 192.1xx.x.xx
tf-dist RUNNING TFJOB 1m tf-dist-worker-0 192.1xx.x.xx
tf-dist RUNNING TFJOB 1m tf-dist-worker-1 192.1xx.x.xx
Your tensorboard will be available on:
http://192.1xx.x.xx:31870The TensorBoard endpoint lines appear only if --tensorboard was enabled during submission.
Check GPU allocation by job:
arena top jobExpected output:
NAME GPU(Requests) GPU(Allocated) STATUS TRAINER AGE NODE
tf-dist 2 2 RUNNING tfjob 1m 192.1xx.x.x
tf-git 1 0 SUCCEEDED tfjob 2h N/A
Total Allocated GPUs of Training Job:
2
Total Requested GPUs of Training Job:
3Check GPU allocation across all nodes:
arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx master ready 0 0
cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx master ready 0 0
cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx master ready 0 0
cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx <none> ready 2 1
cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx <none> ready 2 1
cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx <none> ready 2 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
2/6 (33%)Step 4: View TensorBoard
kubectl port-forward is intended for development and debugging only. It does not provide production-level reliability or security. For production-ready networking, see Ingress management.
Forward the TensorBoard service to local port 9090:
kubectl port-forward -n default svc/tf-dist-tensorboard 9090:6006Open
localhost:9090in your browser to access TensorBoard.

Step 5: View training job logs
By default, arena logs streams logs from the worker-0 instance. To view logs from a specific instance, get the instance list from arena get, then specify the instance name with -i.
View default logs (worker-0):
arena logs -n default tf-distExpected output:
WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 960: 0.9691
Accuracy at step 970: 0.9677
Accuracy at step 980: 0.9687
Accuracy at step 990: 0.968
Adding run metadata for 999
Total Train-accuracy=0.968View logs from a specific instance:
# Get the instance list
arena get tf-distExpected output:
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
tf-dist SUCCEEDED TFJOB 5m tf-dist-ps-0 192.16x.x.xx
tf-dist SUCCEEDED TFJOB 5m tf-dist-worker-0 192.16x.x.xx
tf-dist SUCCEEDED TFJOB 5m tf-dist-worker-1 192.16x.x.xx
Your tensorboard will be available on:
http://192.16x.x.xx:31870# View logs from a specific worker instance
arena logs tf-dist -i tf-dist-worker-1Expected output:
WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 970: 0.9676
Accuracy at step 980: 0.968
Accuracy at step 990: 0.967
Adding run metadata for 999
Total Train-accuracy=0.967Other log commands:
| Command | Description |
|---|---|
arena logs <job> -f | Stream real-time log output |
arena logs <job> -t N | Show the last N lines of logs |
arena logs --help | List all available log options |
For example, to view the last 5 lines:
arena logs tf-dist -t 5Expected output:
Accuracy at step 9970: 0.9834
Accuracy at step 9980: 0.9828
Accuracy at step 9990: 0.9816
Adding run metadata for 9999
Total Train-accuracy=0.9816(Optional) Step 6: Clean up
After the job completes, delete it to release resources:
arena delete -n default tf-mnist-distExpected output:
INFO[0002] The training job tf-mnist-dist has been deleted successfully