All Products
Search
Document Center

Container Service for Kubernetes:Submit a single-node TensorFlow training job using Arena

Last Updated:Mar 26, 2026

This topic describes how to submit a single-node TensorFlow training job using Arena and monitor training progress with TensorBoard.

Prerequisites

Before you begin, ensure that you have:

Background

This example pulls training code from a GitHub repository using git-sync. The MNIST dataset is stored in shared storage backed by NAS (Network Attached Storage), accessed through a PVC named training-data. The dataset is in the tf_data directory inside the PVC.

When you run arena submit tf, Arena creates a TFJob custom resource in your cluster. If you enable TensorBoard, Arena also creates a TensorBoard deployment and service. Understanding these Kubernetes objects helps when you need to debug failed jobs or inspect cluster resources directly.

Step 1: Check GPU availability

Before submitting a job, confirm that GPU resources are available in the cluster:

arena top node

Expected output:

NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)

The cluster has two GPU nodes, each with two idle GPUs available.

Step 2: Submit a TensorFlow training job

Submit a single-node, single-GPU TensorFlow training job:

arena submit tf \
    --name=tf-mnist \
    --working-dir=/root \
    --workers=1 \
    --gpus=1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Expected output:

service/tf-mnist-tensorboard created
deployment.apps/tf-mnist-tensorboard created
tfjob.kubeflow.org/tf-mnist created
INFO[0005] The Job tf-mnist has been submitted successfully
INFO[0005] You can run `arena get tf-mnist --type tfjob -n default` to check the job status

Parameters

The command supports the following categories of configuration:

  • Job identity and compute: --name, --workers, --gpus, --image

  • Code synchronization: --sync-mode, --sync-source, --env

  • Data access: --data

  • Monitoring: --tensorboard, --logdir

Parameter Required Description Default
--name Required The job name. Must be globally unique. None
--working-dir Optional The directory where the training command runs. /root
--gpus Optional The number of GPUs allocated to each worker node. 0
--image Required The address of the container registry for the training environment. None
--sync-mode Optional The code synchronization mode. Valid values: git, rsync. None
--sync-source Optional The repository URL for code synchronization. Use with --sync-mode. The code is cloned to the code/ directory under --working-dir. In this example, the cloned path is /root/code/arena. None
--data Optional Mounts a PVC to the training environment. Format: <pvc-name>:<mount-path>. Run arena data list to see available PVCs in the current cluster. None
--tensorboard Optional Starts a TensorBoard service for the training job. If omitted, TensorBoard is not started. None
--logdir Optional The path TensorBoard reads event data from. Use with --tensorboard. /training_logs
Note

To verify that the training-data PVC is available, run arena data list: If no PVC is listed, create one. For details, see Configure NAS shared storage.

NAME           ACCESSMODE    DESCRIPTION  OWNER  AGE
training-data  ReadWriteMany              35m

Use a private Git repository

To pull code from a private repository, set the GIT_SYNC_USERNAME and GIT_SYNC_PASSWORD environment variables:

arena submit tf \
    --name=tf-mnist \
    --working-dir=/root \
    --workers=1 \
    --gpus=1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --env=GIT_SYNC_USERNAME=yourname \
    --env=GIT_SYNC_PASSWORD=yourpwd \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data --dir /mnt/tf_data/logs"

Arena uses git-sync internally for code synchronization. All environment variables defined in the git-sync project are supported.

Important

This example pulls source code from GitHub. If the pull fails due to network issues, the demo image already includes the sample code at /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py. Submit the job without code synchronization:

arena submit tf \
    --name=tf-mnist \
    --working-dir=/root \
    --workers=1 \
    --gpus=1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Step 3: Monitor the training job

Use the following commands to track job status and GPU usage after submission.

  1. List all Arena-managed jobs:

    arena list

    Expected output:

    NAME      STATUS   TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
    tf-mnist  RUNNING  TFJOB    3s        1               1               192.168.xxx.xxx
  2. Check GPU usage for the job:

    arena top job

    Expected output:

    NAME      STATUS   TRAINER  AGE  GPU(Requested)  GPU(Allocated)  NODE
    tf-mnist  RUNNING  TFJOB    29s  1               1               192.168.xxx.xxx
    
    Total Allocated/Requested GPUs of Training Jobs: 1/1
  3. Check cluster-level GPU usage:

    arena top node

    Expected output:

    NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           1
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    1/4 (25.0%)
  4. View the job details, including instance status and TensorBoard endpoint:

    arena get -n default tf-mnist

    Expected output:

    Name:        tf-mnist
    Status:      RUNNING
    Namespace:   default
    Priority:    N/A
    Trainer:     TFJOB
    Duration:    22s
    CreateTime:  2026-01-26 16:01:42
    EndTime:
    
    Instances:
      NAME              STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
      ----              ------   ---  --------  --------------  ----
      tf-mnist-chief-0  Running  45s  true      1               cn-beijing.192.168.xxx.xxx
    
    Tensorboard:
      Your tensorboard will be available on:
      http://192.168.xxx.xxx:31243
    Note

    The TensorBoard endpoint is shown only when --tensorboard is enabled during job submission.

Step 4: View TensorBoard

After the job starts, forward the TensorBoard service port to your local machine and open it in a browser.

  1. Map the TensorBoard service to port 9090 on your local machine:

    Important

    Port forwarding with kubectl port-forward is not suitable for production environments — it is not reliable, secure, or extensible. Use it only for development and debugging. For production-grade networking in ACK clusters, see Ingress management.

    kubectl port-forward -n default svc/tf-mnist-tensorboard 9090:6006
  2. In a browser, go to http://localhost:9090 to view the training metrics.

    tensorboard

Step 5: View the training job log

View the job log:

arena logs -n default tf-mnist

Expected output:

Train Epoch: 14 [55680/60000 (93%)]     Loss: 0.029811
Train Epoch: 14 [56320/60000 (94%)]     Loss: 0.029721
Train Epoch: 14 [56960/60000 (95%)]     Loss: 0.029682
Train Epoch: 14 [57600/60000 (96%)]     Loss: 0.029781
Train Epoch: 14 [58240/60000 (97%)]     Loss: 0.029708
Train Epoch: 14 [58880/60000 (98%)]     Loss: 0.029761
Train Epoch: 14 [59520/60000 (99%)]     Loss: 0.029684

Test Accuracy: 9842/10000 (98.42%)

938/938 - 3s - loss: 0.0299 - accuracy: 0.9924 - val_loss: 0.0446 - val_accuracy: 0.9842 - lr: 0.0068 - 3s/epoch - 3ms/step
Note
  • Stream logs in real time: add -f

  • Show only the last N lines: add -t N or --tail N

  • See all options: run arena logs --help

(Optional) Step 6: Clean up

After the training job completes, delete it to free up cluster resources:

arena delete -n default tf-mnist

Expected output:

INFO[0002] The training job tf-mnist has been deleted successfully