All Products
Search
Document Center

Container Service for Kubernetes:Submit a TensorFlow single-node training job using Arena

Last Updated:Mar 11, 2026

This topic describes how to submit a TensorFlow distributed training job based on the PS-Worker model using Arena and visualize the training job using TensorBoard.

Prerequisites

Background information

This example downloads source code from a Git URL. The dataset is stored in a shared storage system (PV and PVC based on NAS). The example assumes you have a PVC instance named training-data (a shared storage). This instance contains a directory tf_data, which stores the dataset used in the example.

Procedure

Step 1: View GPU resources

arena top node

Expected output:

NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)

The output shows that the cluster has two GPU nodes. Each node has two idle GPUs available for training jobs.

Step 2: Submit a TensorFlow training job

Submit a TensorFlow job by running a command in the format of arena submit tfjob/tf [--flag] command.

Submit a TensorFlow distributed training job in PS-Worker mode using the following code example. It includes one PS node and two Worker nodes.

arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Expected output:

service/tf-mnist-dist-tensorboard created
deployment.apps/tf-mnist-dist-tensorboard created
tfjob.kubeflow.org/tf-mnist-dist created
INFO[0004] The Job tf-mnist-dist has been submitted successfully
INFO[0004] You can run `arena get tf-mnist-dist --type tfjob -n default` to check the job status

Parameter descriptions are in the following table.

Parameter

Required

Description

Default value

--name

Required

Specify the name of the submitted job. It must be globally unique and cannot be duplicated.

None

--working-dir

Optional

Specify the directory where the current command runs.

/root

--gpus

Optional

Specify the number of GPU cards that the job Worker node uses.

0

--workers

Optional

Specify the number of job Worker nodes.

1

--image

Required if --worker-image and --ps-image are not specified separately.

Specify the image address of the training environment. If --worker-image or --ps-image is not specified, both Worker nodes and PS nodes use this image address.

None

--worker-image

Required if --image is not specified.

Specify the image address that the job Worker node uses. If --image is also present, it overwrites --image.

None

--sync-mode

Optional

The code synchronization mode. You can specify git or rsync. This topic uses Git mode.

None

--sync-source

Optional

The repository address for code synchronization. Use this parameter with --sync-mode. This example uses Git mode. This parameter can be any GitHub project address or a Git-supported code hosting address, such as an Alibaba Cloud Code project address. The project code downloads to the code/ directory under --working-dir. For this example, it is: /root/code/tensorflow-sample-code.

None

--ps

Required for distributed jobs

Specify the number of parameter server (PS) nodes.

0

--ps-image

Required if --image is not specified.

Specify the image address of the PS node. If --image is also present, it overwrites --image.

None

--data

Optional

Mount the shared storage volume PVC to the running environment. It consists of two parts, separated by a colon (:). The left side of the colon is the name of the PVC you prepared. View the list of available PVCs in the current cluster by running the command arena data list. The right side of the colon is the path where you want to mount the PVC in the running environment, which is also the local path where your training code reads data. This mounting method allows your code to access the PVC data.

Note

Run arena data list to view the available PVCs in the current cluster for this example.

NAME           ACCESSMODE     DESCRIPTION  OWNER  AGE
training-data  ReadWriteMany                      35m

If no PVCs are available, create one. For more information, see Configure NAS shared storage.

None

--tensorboard

Optional

Enable a TensorBoard service for the training task for data visualization. Use --logdir to specify the event path that TensorBoard reads. If you do not specify this parameter, the TensorBoard service is not enabled.

None

--logdir

Optional

Use this parameter with --tensorboard. This parameter specifies the path where TensorBoard reads event data.

/training_logs

Note

If you use a private Git repository, set the Git username and password by configuring the environment variables GIT_SYNC_USERNAME and GIT_SYNC_PASSWORD.

 arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --env=GIT_SYNC_USERNAME=yourname \
    --env=GIT_SYNC_PASSWORD=yourpwd \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

The arena command uses git-sync to synchronize source code. You can set the environment variables defined in the git-sync project.

Important

This example pulls source code from a GitHub repository. If the code fails to pull due to network issues or other reasons, manually download the code to the shared storage system. The demo image provided in this topic already contains the example code /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py. You can submit the training job directly as follows:

arena submit tf \
    --name=tf-mnist-dist \
    --namespace=default \
    --working-dir=/root \
    --ps=1 \
    --ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --workers=2 \
    --worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --gpus=1 \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Step 3: View TensorFlow training jobs

  1. View all training jobs submitted through Arena.

    arena list

    Expected output:

    NAME     STATUS     TRAINER  AGE  NODE
    tf-dist  RUNNING    TFJOB    58s  192.1xx.x.xx
  2. Run the following command to check the GPU resources used by the job.

    arena top job

    Expected output:

    NAME     GPU(Requests)  GPU(Allocated)  STATUS     TRAINER  AGE  NODE
    tf-dist  2              2               RUNNING    tfjob    1m   192.1xx.x.x
    tf-git   1              0               SUCCEEDED  tfjob    2h   N/A
    
    Total Allocated GPUs of Training Job:
    2
    
    Total Requested GPUs of Training Job:
    3
  3. arena top job

    Expected output:

    NAME     GPU(Requests)  GPU(Allocated)  STATUS     TRAINER  AGE  NODE
    tf-dist  2              2               RUNNING    tfjob    1m   192.1xx.x.x
    tf-git   1              0               SUCCEEDED  tfjob    2h   N/A
    
    Total Allocated GPUs of Training Job:
    2
    
    Total Requested GPUs of Training Job:
    3
  4. arena top node

    Expected output:

    NAME                       IPADDRESS     ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
    cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
    cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  master  ready   0           0
    cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           1
    cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           1
    cn-huhehaote.192.1xx.x.xx  192.1xx.x.xx  <none>  ready   2           0
    -----------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    2/6 (33%)
  5. arena get -n default tf-mnist-dist

    Expected output

    STATUS: RUNNING
    NAMESPACE: default
    PRIORITY: N/A
    TRAINING DURATION: 1m
    
    NAME     STATUS   TRAINER  AGE  INSTANCE          NODE
    tf-dist  RUNNING  TFJOB    1m   tf-dist-ps-0      192.1xx.x.xx
    tf-dist  RUNNING  TFJOB    1m   tf-dist-worker-0  192.1xx.x.xx
    tf-dist  RUNNING  TFJOB    1m   tf-dist-worker-1  192.1xx.x.xx
    
    Your tensorboard will be available on:
    http://192.1xx.x.xx:31870
    Note

    This topic shows an example that enables TensorBoard. In the job details above, the last two lines show the TensorBoard web endpoint. If you do not enable TensorBoard, these two lines do not appear.

  6. Run the following command to check the GPU resources used by the cluster.

  7. Execute the following command to get task details.

Step 4: View TensorBoard

View TensorBoard in a browser.

  1. Run the following command locally to map the TensorBoard in the cluster to local port 9090.

  2. Important

    Note that port forwarding established by kubectl port-forward does not provide production-level reliability, security, or extensibility. Therefore, it is only suitable for development and debugging purposes and not for use in production environments. For more information about production-ready networking solutions within Kubernetes clusters, see Ingress management.

    kubectl port-forward -n default svc/tf-dist-tensorboard 9090:6006
  3. Access localhost:9090 in your browser to view TensorBoard. The following image shows TensorBoard.

    tf

Step 5: View training job logs

Run the following command to get job log information.

arena logs -n default tf-dist

Expected output:

WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 960: 0.9691
Accuracy at step 970: 0.9677
Accuracy at step 980: 0.9687
Accuracy at step 990: 0.968
Adding run metadata for 999
Total Train-accuracy=0.968

When you use the preceding command to get job log information, logs of the worker-0 node are output by default. To view the logs of a specific node in a distributed training task, first view the job details to get the list of job nodes, then use the command arena logs $job_name -i $instance_name to view the logs of a specific instance.

The example code is as follows.

arena get tf-dist

Expected output:

STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m

NAME     STATUS     TRAINER  AGE  INSTANCE          NODE
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-ps-0      192.16x.x.xx
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-worker-0  192.16x.x.xx
tf-dist  SUCCEEDED  TFJOB    5m   tf-dist-worker-1  192.16x.x.xx

Your tensorboard will be available on:
http://192.16x.x.xx:31870

Run the following command to get job logs.

arena logs tf-dist -i tf-dist-worker-1

Expected output:

WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 970: 0.9676
Accuracy at step 980: 0.968
Accuracy at step 990: 0.967
Adding run metadata for 999
Total Train-accuracy=0.967

You can also view the real-time log output of the job by running the command arena logs $job_name -f. View the last N lines of logs by running the command arena logs $job_name -t N. Query more parameter usage by running arena logs --help.

The example code for viewing the last N lines of logs is as follows.

arena logs tf-dist -t 5

Expected output:

Accuracy at step 9970: 0.9834
Accuracy at step 9980: 0.9828
Accuracy at step 9990: 0.9816
Adding run metadata for 9999
Total Train-accuracy=0.9816

(Optional) Step 6: Clean up the environment

If you no longer need the training job after it finishes, run the following command to delete it:

arena delete -n default tf-mnist-dist

Expected output:

INFO[0002] The training job tf-mnist-dist has been deleted successfully