All Products
Search
Document Center

Container Service for Kubernetes:Submit a single-node TensorFlow training job using Arena

Last Updated:Jan 28, 2026

TensorFlow is an open source deep learning framework that is widely used for various deep learning training tasks. This topic describes how to submit a single-node TensorFlow training job using Arena and view the training job using TensorBoard for visualization.

Prerequisites

Background information

This example downloads the source code from a Git URL. The dataset is stored in a shared storage system that uses a Persistent Volume (PV) and a Persistent Volume Claim (PVC) on NAS. This example assumes that you have a PVC named training-data that contains the dataset in a directory named tf_data.

Procedure

Step 1: View GPU resources

arena top node

Expected output:

NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)

The output shows that the cluster has two GPU nodes. Each node has two idle GPUs available for training jobs.

Step 2: Submit a TensorFlow training job

Run the arena submit tfjob/tf [--flag] command to submit a TensorFlow job.

The following code provides an example of how to submit a single-node, single-GPU TensorFlow task.

arena submit tf \
    --name=tf-mnist \
    --working-dir=/root \
    --workers=1 \
    --gpus=1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Expected output:

service/tf-mnist-tensorboard created
deployment.apps/tf-mnist-tensorboard created
tfjob.kubeflow.org/tf-mnist created
INFO[0005] The Job tf-mnist has been submitted successfully
INFO[0005] You can run `arena get tf-mnist --type tfjob -n default` to check the job status

The following table describes the parameters.

Parameter

Required

Description

Default

--name

Required

The name of the job to submit. The name must be globally unique.

None

--working-dir

Optional

The directory where the command is executed.

/root

--gpus

Optional

The number of GPUs that the worker node of the job requires.

0

--image

Required

The Registry Address of the training environment.

None

--sync-mode

Optional

The code synchronization mode. You can specify git or rsync. This example uses the Git mode.

None

--sync-source

Optional

The repository address for code synchronization. This parameter must be used with --sync-mode. This example uses the Git mode. The value of this parameter can be the address of any GitHub project or other Git-based code hosting service, such as an Alibaba Cloud Code project. The project code is downloaded to the code/ directory under --working-dir. In this example, the path is /root/code/arena.

None

--data

Optional

Mounts a shared storage volume (PVC) to the running environment. This parameter consists of two parts separated by a colon (:). The part to the left of the colon is the name of the PVC that you have prepared. You can run arena data list to view the list of available PVCs in the current cluster. The part to the right of the colon is the path in the running environment where you want to mount the PVC. This is also the local path from which your training code reads data. By mounting the PVC, your code can access the data on the PVC.

Note

Run arena data list to view the list of available PVCs in the current cluster for this example.

NAME           ACCESSMODE     DESCRIPTION  OWNER  AGE
training-data  ReadWriteMany                      35m

If no PVC is available, create one. For more information, see Configure NAS shared storage.

None

--tensorboard

Optional

Starts a TensorBoard Service for the training task for data visualization. You can use this parameter with --logdir to specify the event path that TensorBoard reads. If you do not specify this parameter, the TensorBoard Service is not started.

None

--logdir

Optional

This parameter must be used with --tensorboard. It specifies the path from which TensorBoard reads event data.

/training_logs

Note

If you use a private Git repository, you can set the GIT_SYNC_USERNAME and GIT_SYNC_PASSWORD environment variables to specify the Git username and password.

arena submit tf \
    --name=tf-mnist \
    --working-dir=/root \
    --workers=1 \
    --gpus=1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --sync-mode=git \
    --sync-source=https://github.com/kubeflow/arena.git \
    --env=GIT_SYNC_BRANCH=master \
    --env=GIT_SYNC_USERNAME=yourname \
    --env=GIT_SYNC_PASSWORD=yourpwd \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data --dir /mnt/tf_data/logs"

The arena command uses git-sync to sync the source code. You can set the environment variables defined in the git-sync project.

Important

This example pulls the source code from a GitHub repository. If the code fails to be pulled due to network issues, you can manually download the code to the shared storage system. The demo image provided in this topic already contains the sample code /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py. You can directly submit the training job as follows:

arena submit tf \
    --name=tf-mnist \
    --working-dir=/root \
    --workers=1 \
    --gpus=1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
    --data=training-data:/mnt \
    --tensorboard \
    --logdir=/mnt/tf_data/logs \
    "python /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"

Step 3: View the TensorFlow training job

  1. Run the following command to view all jobs submitted using Arena.

    arena list

    Expected output:

    NAME      STATUS   TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
    tf-mnist  RUNNING  TFJOB    3s        1               1               192.168.xxx.xxx
  2. You can run the following command to check the GPU resources used by the job.

    arena top job

    Expected output:

    NAME      STATUS   TRAINER  AGE  GPU(Requested)  GPU(Allocated)  NODE
    tf-mnist  RUNNING  TFJOB    29s  1               1               192.168.xxx.xxx
    
    Total Allocated/Requested GPUs of Training Jobs: 1/1
  3. Run the following command to check the GPU resources used by the cluster.

    arena top node

    Expected output:

    NAME                        IPADDRESS        ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   0           0
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           1
    cn-beijing.192.168.xxx.xxx  192.168.xxx.xxx  <none>  Ready   2           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    1/4 (25.0%)
  4. Run the following command to view the details of the training job.

    arena get -n default tf-mnist

    Expected output:

    Name:        tf-mnist
    Status:      RUNNING
    Namespace:   default
    Priority:    N/A
    Trainer:     TFJOB
    Duration:    22s
    CreateTime:  2026-01-26 16:01:42
    EndTime:
    
    Instances:
      NAME              STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
      ----              ------   ---  --------  --------------  ----
      tf-mnist-chief-0  Running  45s  true      1               cn-beijing.192.168.xxx.xxx
    
    Tensorboard:
      Your tensorboard will be available on:
      http://192.168.xxx.xxx:31243
    Note

    Because TensorBoard is enabled in this example, the last two lines of the job details show the web endpoint for TensorBoard. If TensorBoard is not enabled, this information is not displayed.

Step 4: View TensorBoard

View TensorBoard in a browser.

  1. Run the following command on your local machine to map the TensorBoard Service in the cluster to port 9090 on your local machine.

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.

    kubectl port-forward -n default svc/tf-mnist-tensorboard 9090:6006
  2. In a browser, go to http://localhost:9090 to view TensorBoard, as shown in the following figure.

    tensorboard

Step 5: View the training job log

Run the following command to view the job log.

arena logs -n default tf-mnist

Expected output:

Train Epoch: 14 [55680/60000 (93%)]     Loss: 0.029811
Train Epoch: 14 [56320/60000 (94%)]     Loss: 0.029721
Train Epoch: 14 [56960/60000 (95%)]     Loss: 0.029682
Train Epoch: 14 [57600/60000 (96%)]     Loss: 0.029781
Train Epoch: 14 [58240/60000 (97%)]     Loss: 0.029708
Train Epoch: 14 [58880/60000 (98%)]     Loss: 0.029761
Train Epoch: 14 [59520/60000 (99%)]     Loss: 0.029684

Test Accuracy: 9842/10000 (98.42%)

938/938 - 3s - loss: 0.0299 - accuracy: 0.9924 - val_loss: 0.0446 - val_accuracy: 0.9842 - lr: 0.0068 - 3s/epoch - 3ms/step
Note
  • To view the job log in real time, add the -f parameter.

  • To view only the last N lines of the log, add the -t N or --tail N parameter.

  • For more usage information, run arena logs --help.

(Optional) Step 6: Clean up the environment

After the training job is complete, if you no longer need it, run the following command to delete it:

arena delete -n default tf-mnist

Expected output:

INFO[0002] The training job tf-mnist has been deleted successfully