TensorFlow is an open source deep learning framework that is widely used for various deep learning training tasks. This topic describes how to submit a single-node TensorFlow training job using Arena and view the training job using TensorBoard for visualization.
Prerequisites
A Kubernetes cluster that contains GPUs is created. For more information, see Create a Kubernetes cluster that contains GPUs.
The cluster nodes can access the public network. For more information, see Enable Internet access for a cluster.
The Arena client is installed. For more information, see Configure the Arena client.
A Persistent Volume Claim (PVC) instance named
training-datais created, and the MNIST dataset is stored in the tf_data path. For more information, see Configure NAS shared storage.
Background information
This example downloads the source code from a Git URL. The dataset is stored in a shared storage system that uses a Persistent Volume (PV) and a Persistent Volume Claim (PVC) on NAS. This example assumes that you have a PVC named training-data that contains the dataset in a directory named tf_data.
Procedure
Step 1: View GPU resources
arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 0 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 0 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 2 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 2 0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)The output shows that the cluster has two GPU nodes. Each node has two idle GPUs available for training jobs.
Step 2: Submit a TensorFlow training job
Run the arena submit tfjob/tf [--flag] command to submit a TensorFlow job.
The following code provides an example of how to submit a single-node, single-GPU TensorFlow task.
arena submit tf \
--name=tf-mnist \
--working-dir=/root \
--workers=1 \
--gpus=1 \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--sync-mode=git \
--sync-source=https://github.com/kubeflow/arena.git \
--env=GIT_SYNC_BRANCH=master \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"Expected output:
service/tf-mnist-tensorboard created
deployment.apps/tf-mnist-tensorboard created
tfjob.kubeflow.org/tf-mnist created
INFO[0005] The Job tf-mnist has been submitted successfully
INFO[0005] You can run `arena get tf-mnist --type tfjob -n default` to check the job statusThe following table describes the parameters.
Parameter | Required | Description | Default |
--name | Required | The name of the job to submit. The name must be globally unique. | None |
--working-dir | Optional | The directory where the command is executed. | /root |
--gpus | Optional | The number of GPUs that the worker node of the job requires. | 0 |
--image | Required | The Registry Address of the training environment. | None |
--sync-mode | Optional | The code synchronization mode. You can specify git or rsync. This example uses the Git mode. | None |
--sync-source | Optional | The repository address for code synchronization. This parameter must be used with --sync-mode. This example uses the Git mode. The value of this parameter can be the address of any GitHub project or other Git-based code hosting service, such as an Alibaba Cloud Code project. The project code is downloaded to the code/ directory under --working-dir. In this example, the path is /root/code/arena. | None |
--data | Optional | Mounts a shared storage volume (PVC) to the running environment. This parameter consists of two parts separated by a colon ( Note Run If no PVC is available, create one. For more information, see Configure NAS shared storage. | None |
--tensorboard | Optional | Starts a TensorBoard Service for the training task for data visualization. You can use this parameter with --logdir to specify the event path that TensorBoard reads. If you do not specify this parameter, the TensorBoard Service is not started. | None |
--logdir | Optional | This parameter must be used with --tensorboard. It specifies the path from which TensorBoard reads event data. | /training_logs |
If you use a private Git repository, you can set the GIT_SYNC_USERNAME and GIT_SYNC_PASSWORD environment variables to specify the Git username and password.
arena submit tf \
--name=tf-mnist \
--working-dir=/root \
--workers=1 \
--gpus=1 \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--sync-mode=git \
--sync-source=https://github.com/kubeflow/arena.git \
--env=GIT_SYNC_BRANCH=master \
--env=GIT_SYNC_USERNAME=yourname \
--env=GIT_SYNC_PASSWORD=yourpwd \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data --dir /mnt/tf_data/logs"The arena command uses git-sync to sync the source code. You can set the environment variables defined in the git-sync project.
This example pulls the source code from a GitHub repository. If the code fails to be pulled due to network issues, you can manually download the code to the shared storage system. The demo image provided in this topic already contains the sample code /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py. You can directly submit the training job as follows:
arena submit tf \
--name=tf-mnist \
--working-dir=/root \
--workers=1 \
--gpus=1 \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"Step 3: View the TensorFlow training job
Run the following command to view all jobs submitted using Arena.
arena listExpected output:
NAME STATUS TRAINER DURATION GPU(Requested) GPU(Allocated) NODE tf-mnist RUNNING TFJOB 3s 1 1 192.168.xxx.xxxYou can run the following command to check the GPU resources used by the job.
arena top jobExpected output:
NAME STATUS TRAINER AGE GPU(Requested) GPU(Allocated) NODE tf-mnist RUNNING TFJOB 29s 1 1 192.168.xxx.xxx Total Allocated/Requested GPUs of Training Jobs: 1/1Run the following command to check the GPU resources used by the cluster.
arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 0 0 cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 0 0 cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 2 1 cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 2 0 --------------------------------------------------------------------------------------------------- Allocated/Total GPUs In Cluster: 1/4 (25.0%)Run the following command to view the details of the training job.
arena get -n default tf-mnistExpected output:
Name: tf-mnist Status: RUNNING Namespace: default Priority: N/A Trainer: TFJOB Duration: 22s CreateTime: 2026-01-26 16:01:42 EndTime: Instances: NAME STATUS AGE IS_CHIEF GPU(Requested) NODE ---- ------ --- -------- -------------- ---- tf-mnist-chief-0 Running 45s true 1 cn-beijing.192.168.xxx.xxx Tensorboard: Your tensorboard will be available on: http://192.168.xxx.xxx:31243NoteBecause TensorBoard is enabled in this example, the last two lines of the job details show the web endpoint for TensorBoard. If TensorBoard is not enabled, this information is not displayed.
Step 4: View TensorBoard
View TensorBoard in a browser.
Run the following command on your local machine to map the TensorBoard Service in the cluster to port 9090 on your local machine.
ImportantPort forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.
kubectl port-forward -n default svc/tf-mnist-tensorboard 9090:6006In a browser, go to
http://localhost:9090to view TensorBoard, as shown in the following figure.
Step 5: View the training job log
Run the following command to view the job log.
arena logs -n default tf-mnistExpected output:
Train Epoch: 14 [55680/60000 (93%)] Loss: 0.029811
Train Epoch: 14 [56320/60000 (94%)] Loss: 0.029721
Train Epoch: 14 [56960/60000 (95%)] Loss: 0.029682
Train Epoch: 14 [57600/60000 (96%)] Loss: 0.029781
Train Epoch: 14 [58240/60000 (97%)] Loss: 0.029708
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.029761
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.029684
Test Accuracy: 9842/10000 (98.42%)
938/938 - 3s - loss: 0.0299 - accuracy: 0.9924 - val_loss: 0.0446 - val_accuracy: 0.9842 - lr: 0.0068 - 3s/epoch - 3ms/stepTo view the job log in real time, add the
-fparameter.To view only the last N lines of the log, add the
-t Nor--tail Nparameter.For more usage information, run
arena logs --help.
(Optional) Step 6: Clean up the environment
After the training job is complete, if you no longer need it, run the following command to delete it:
arena delete -n default tf-mnistExpected output:
INFO[0002] The training job tf-mnist has been deleted successfully