Run distributed TensorFlow training jobs with Arena - Container Service for Kubernetes

This document walks you through submitting a TensorFlow distributed training job in PS-Worker mode using Arena and monitoring training progress with TensorBoard.

Prerequisites

Before you begin, ensure that you have:

A Kubernetes cluster with GPU nodes. See Create a Kubernetes cluster that contains GPUs.
Cluster nodes with public network access. See Enable Internet access for a cluster.
The Arena client installed. See Configure the Arena client.
A Persistent Volume Claim (PVC) named training-data, with the MNIST dataset stored under the tf_data path. See Configure NAS shared storage.

Step 1: Check available GPU resources

Run the following command to view GPU availability across all nodes:

arena top node

Expected output:

NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)

The cluster has two GPU nodes, each with two idle GPUs available for training.

Step 2: Submit a TensorFlow distributed training job

The general command format is arena submit tfjob/tf [--flag] command.

Important

For distributed training in PS-Worker mode, the following parameters are required:

--name — a globally unique job name
--ps — number of parameter server (PS) nodes
--workers — number of worker nodes
--ps-image — container image for the PS node (or --image if PS and workers share the same image)
--worker-image — container image for worker nodes (or --image if PS and workers share the same image)
--gpus — number of GPUs per worker (required for GPU workloads)

Submit a distributed training job with one PS node and two worker nodes:

arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Expected output:

service/tf-mnist-dist-tensorboard created
deployment.apps/tf-mnist-dist-tensorboard created
tfjob.kubeflow.org/tf-mnist-dist created
INFO[0004] The Job tf-mnist-dist has been submitted successfully
INFO[0004] You can run `arena get tf-mnist-dist --type tfjob -n default` to check the job status

The following table describes the key parameters. For a full parameter reference, run arena submit tf --help.

Parameter	Required	Description	Default
`--name`	Yes	Unique job name within the cluster.	None
`--working-dir`	No	Directory where the training command runs. Synced code is placed under `code/` inside this directory.	`/root`
`--gpus`	No	Number of GPUs allocated per worker node.	`0`
`--workers`	No	Number of worker nodes.	`1`
`--image`	Required if `--worker-image` and `--ps-image` are not specified separately	Container image for both worker and PS nodes. Overridden by `--worker-image` or `--ps-image` if specified.	None
`--worker-image`	Required if `--image` is not specified	Container image for worker nodes. Takes precedence over `--image`.	None
`--ps`	Yes (distributed jobs)	Number of PS nodes.	`0`
`--ps-image`	Required if `--image` is not specified	Container image for PS nodes. Takes precedence over `--image`.	None
`--sync-mode`	No	Code synchronization mode: `git` or `rsync`.	None
`--sync-source`	No	Source repository URL for code synchronization. Use with `--sync-mode`. The code is downloaded to `code/` under `--working-dir`.	None
`--data`	No	Mounts a PVC into the training environment. Format: `<pvc-name>:<mount-path>`. Run `arena data list` to see available PVCs.	None
`--tensorboard`	No	Enables a TensorBoard service for the training job. Use `--logdir` to specify where TensorBoard reads event data.	None
`--logdir`	No	Path where TensorBoard reads event data. Use with `--tensorboard`.	`/training_logs`

Using a private Git repository

Arena uses git-sync to pull source code. For private repositories, pass credentials as environment variables:

arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --env=GIT_SYNC_USERNAME=<your-username> \
    --env=GIT_SYNC_PASSWORD=<your-password> \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

You can set any environment variable defined in the git-sync project using --env.

Important

This example pulls source code from GitHub. If the pull fails due to network issues, the demo image already contains the example code at /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py. Submit the job without --sync-mode and --sync-source:

arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Step 3: Monitor the training job

List all submitted jobs:

arena list

Expected output:

NAME     STATUS     TRAINER  AGE  NODE
tf-dist  RUNNING    TFJOB    58s  192.1xx.x.xx

Check job details and locate each instance:

arena get -n default tf-mnist-dist

Expected output:

STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m

NAME     STATUS   TRAINER  AGE  INSTANCE          NODE
tf-dist  RUNNING  TFJOB    1m   tf-dist-ps-0      192.1xx.x.xx
tf-dist  RUNNING  TFJOB    1m   tf-dist-worker-0  192.1xx.x.xx
tf-dist  RUNNING  TFJOB    1m   tf-dist-worker-1  192.1xx.x.xx

Your tensorboard will be available on:
http://192.1xx.x.xx:31870

Note

The TensorBoard endpoint lines appear only if --tensorboard was enabled during submission.

Check GPU allocation by job:

arena top job

Expected output:

NAME     GPU(Requests)  GPU(Allocated)  STATUS     TRAINER  AGE  NODE
tf-dist  2              2               RUNNING    tfjob    1m   192.1xx.x.x
tf-git   1              0               SUCCEEDED  tfjob    2h   N/A

Total Allocated GPUs of Training Job:
2

Total Requested GPUs of Training Job:
3

Check GPU allocation across all nodes:

arena top node

Expected output:

NAME                       IPADDRESS     ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           1
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           1
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
2/6 (33%)

Step 4: View TensorBoard

Important

kubectl port-forward is intended for development and debugging only. It does not provide production-level reliability or security. For production-ready networking, see Ingress management.

Forward the TensorBoard service to local port 9090:

kubectl port-forward -n default svc/tf-dist-tensorboard 9090:6006

Open localhost:9090 in your browser to access TensorBoard.

Step 5: View training job logs

By default, arena logs streams logs from the worker-0 instance. To view logs from a specific instance, get the instance list from arena get, then specify the instance name with -i.

View default logs (worker-0):

arena logs -n default tf-dist

Expected output:

WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 960: 0.9691
Accuracy at step 970: 0.9677
Accuracy at step 980: 0.9687
Accuracy at step 990: 0.968
Adding run metadata for 999
Total Train-accuracy=0.968

View logs from a specific instance:

# Get the instance list
arena get tf-dist

Expected output:

STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m

NAME     STATUS     TRAINER  AGE  INSTANCE          NODE
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-ps-0      192.16x.x.xx
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-worker-0  192.16x.x.xx
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-worker-1  192.16x.x.xx

Your tensorboard will be available on:
http://192.16x.x.xx:31870

# View logs from a specific worker instance
arena logs tf-dist -i tf-dist-worker-1

Expected output:

WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 970: 0.9676
Accuracy at step 980: 0.968
Accuracy at step 990: 0.967
Adding run metadata for 999
Total Train-accuracy=0.967

Other log commands:

Command	Description
`arena logs <job> -f`	Stream real-time log output
`arena logs <job> -t N`	Show the last N lines of logs
`arena logs --help`	List all available log options

For example, to view the last 5 lines:

arena logs tf-dist -t 5

Expected output:

Accuracy at step 9970: 0.9834
Accuracy at step 9980: 0.9828
Accuracy at step 9990: 0.9816
Adding run metadata for 9999
Total Train-accuracy=0.9816

(Optional) Step 6: Clean up

After the job completes, delete it to release resources:

arena delete -n default tf-mnist-dist

Expected output:

INFO[0002] The training job tf-mnist-dist has been deleted successfully