All Products
Search
Document Center

Container Service for Kubernetes:Submit a TensorFlow single-node training job using Arena

Last Updated:Mar 26, 2026

This document walks you through submitting a TensorFlow distributed training job in PS-Worker mode using Arena and monitoring training progress with TensorBoard.

Prerequisites

Before you begin, ensure that you have:

Step 1: Check available GPU resources

Run the following command to view GPU availability across all nodes:

arena top node

Expected output:

NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)

The cluster has two GPU nodes, each with two idle GPUs available for training.

Step 2: Submit a TensorFlow distributed training job

The general command format is arena submit tfjob/tf [--flag] command.

Important

For distributed training in PS-Worker mode, the following parameters are required:

  • --name — a globally unique job name

  • --ps — number of parameter server (PS) nodes

  • --workers — number of worker nodes

  • --ps-image — container image for the PS node (or --image if PS and workers share the same image)

  • --worker-image — container image for worker nodes (or --image if PS and workers share the same image)

  • --gpus — number of GPUs per worker (required for GPU workloads)

Submit a distributed training job with one PS node and two worker nodes:

arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Expected output:

service/tf-mnist-dist-tensorboard created
deployment.apps/tf-mnist-dist-tensorboard created
tfjob.kubeflow.org/tf-mnist-dist created
INFO[0004] The Job tf-mnist-dist has been submitted successfully
INFO[0004] You can run `arena get tf-mnist-dist --type tfjob -n default` to check the job status

The following table describes the key parameters. For a full parameter reference, run arena submit tf --help.

ParameterRequiredDescriptionDefault
--nameYesUnique job name within the cluster.None
--working-dirNoDirectory where the training command runs. Synced code is placed under code/ inside this directory./root
--gpusNoNumber of GPUs allocated per worker node.0
--workersNoNumber of worker nodes.1
--imageRequired if --worker-image and --ps-image are not specified separatelyContainer image for both worker and PS nodes. Overridden by --worker-image or --ps-image if specified.None
--worker-imageRequired if --image is not specifiedContainer image for worker nodes. Takes precedence over --image.None
--psYes (distributed jobs)Number of PS nodes.0
--ps-imageRequired if --image is not specifiedContainer image for PS nodes. Takes precedence over --image.None
--sync-modeNoCode synchronization mode: git or rsync.None
--sync-sourceNoSource repository URL for code synchronization. Use with --sync-mode. The code is downloaded to code/ under --working-dir.None
--dataNoMounts a PVC into the training environment. Format: <pvc-name>:<mount-path>. Run arena data list to see available PVCs.None
--tensorboardNoEnables a TensorBoard service for the training job. Use --logdir to specify where TensorBoard reads event data.None
--logdirNoPath where TensorBoard reads event data. Use with --tensorboard./training_logs

Using a private Git repository

Arena uses git-sync to pull source code. For private repositories, pass credentials as environment variables:

arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --env=GIT_SYNC_USERNAME=<your-username> \
    --env=GIT_SYNC_PASSWORD=<your-password> \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

You can set any environment variable defined in the git-sync project using --env.

Important

This example pulls source code from GitHub. If the pull fails due to network issues, the demo image already contains the example code at /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py. Submit the job without --sync-mode and --sync-source:

arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Step 3: Monitor the training job

List all submitted jobs:

arena list

Expected output:

NAME     STATUS     TRAINER  AGE  NODE
tf-dist  RUNNING    TFJOB    58s  192.1xx.x.xx

Check job details and locate each instance:

arena get -n default tf-mnist-dist

Expected output:

STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m

NAME     STATUS   TRAINER  AGE  INSTANCE          NODE
tf-dist  RUNNING  TFJOB    1m   tf-dist-ps-0      192.1xx.x.xx
tf-dist  RUNNING  TFJOB    1m   tf-dist-worker-0  192.1xx.x.xx
tf-dist  RUNNING  TFJOB    1m   tf-dist-worker-1  192.1xx.x.xx

Your tensorboard will be available on:
http://192.1xx.x.xx:31870
Note

The TensorBoard endpoint lines appear only if --tensorboard was enabled during submission.

Check GPU allocation by job:

arena top job

Expected output:

NAME     GPU(Requests)  GPU(Allocated)  STATUS     TRAINER  AGE  NODE
tf-dist  2              2               RUNNING    tfjob    1m   192.1xx.x.x
tf-git   1              0               SUCCEEDED  tfjob    2h   N/A

Total Allocated GPUs of Training Job:
2

Total Requested GPUs of Training Job:
3

Check GPU allocation across all nodes:

arena top node

Expected output:

NAME                       IPADDRESS     ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           1
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           1
cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
2/6 (33%)

Step 4: View TensorBoard

Important

kubectl port-forward is intended for development and debugging only. It does not provide production-level reliability or security. For production-ready networking, see Ingress management.

  1. Forward the TensorBoard service to local port 9090:

    kubectl port-forward -n default svc/tf-dist-tensorboard 9090:6006
  2. Open localhost:9090 in your browser to access TensorBoard.

tf

Step 5: View training job logs

By default, arena logs streams logs from the worker-0 instance. To view logs from a specific instance, get the instance list from arena get, then specify the instance name with -i.

View default logs (worker-0):

arena logs -n default tf-dist

Expected output:

WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 960: 0.9691
Accuracy at step 970: 0.9677
Accuracy at step 980: 0.9687
Accuracy at step 990: 0.968
Adding run metadata for 999
Total Train-accuracy=0.968

View logs from a specific instance:

# Get the instance list
arena get tf-dist

Expected output:

STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m

NAME     STATUS     TRAINER  AGE  INSTANCE          NODE
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-ps-0      192.16x.x.xx
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-worker-0  192.16x.x.xx
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-worker-1  192.16x.x.xx

Your tensorboard will be available on:
http://192.16x.x.xx:31870
# View logs from a specific worker instance
arena logs tf-dist -i tf-dist-worker-1

Expected output:

WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 970: 0.9676
Accuracy at step 980: 0.968
Accuracy at step 990: 0.967
Adding run metadata for 999
Total Train-accuracy=0.967

Other log commands:

CommandDescription
arena logs <job> -fStream real-time log output
arena logs <job> -t NShow the last N lines of logs
arena logs --helpList all available log options

For example, to view the last 5 lines:

arena logs tf-dist -t 5

Expected output:

Accuracy at step 9970: 0.9834
Accuracy at step 9980: 0.9828
Accuracy at step 9990: 0.9816
Adding run metadata for 9999
Total Train-accuracy=0.9816

(Optional) Step 6: Clean up

After the job completes, delete it to release resources:

arena delete -n default tf-mnist-dist

Expected output:

INFO[0002] The training job tf-mnist-dist has been deleted successfully