This topic describes how to submit a TensorFlow distributed training job based on the PS-Worker model using Arena and visualize the training job using TensorBoard.
Prerequisites
A Kubernetes cluster that contains GPUs is created. For more information, see Create a Kubernetes cluster that contains GPUs.
The cluster nodes can access the public network. For more information, see Enable Internet access for a cluster.
The Arena client is installed. For more information, see Configure the Arena client.
A Persistent Volume Claim (PVC) instance named
training-datais created, and the MNIST dataset is stored in the tf_data path. For more information, see Configure NAS shared storage.
Background information
This example downloads source code from a Git URL. The dataset is stored in a shared storage system (PV and PVC based on NAS). The example assumes you have a PVC instance named training-data (a shared storage). This instance contains a directory tf_data, which stores the dataset used in the example.
Procedure
Step 1: View GPU resources
arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 0 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 0 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 2 0
cn-beijing.192.168.xxx.xxx 192.168.xxx.xxx <none> Ready 2 0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/4 (0.0%)The output shows that the cluster has two GPU nodes. Each node has two idle GPUs available for training jobs.
Step 2: Submit a TensorFlow training job
Submit a TensorFlow job by running a command in the format of arena submit tfjob/tf [--flag] command.
Submit a TensorFlow distributed training job in PS-Worker mode using the following code example. It includes one PS node and two Worker nodes.
arena submit tf \
--name=tf-mnist-dist \
--namespace=default \
--working-dir=/root \
--ps=1 \
--ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--workers=2 \
--worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--gpus=1 \
--sync-mode=git \
--sync-source=https://github.com/kubeflow/arena.git \
--env=GIT_SYNC_BRANCH=master \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"Expected output:
service/tf-mnist-dist-tensorboard created
deployment.apps/tf-mnist-dist-tensorboard created
tfjob.kubeflow.org/tf-mnist-dist created
INFO[0004] The Job tf-mnist-dist has been submitted successfully
INFO[0004] You can run `arena get tf-mnist-dist --type tfjob -n default` to check the job statusParameter descriptions are in the following table.
Parameter | Required | Description | Default value |
--name | Required | Specify the name of the submitted job. It must be globally unique and cannot be duplicated. | None |
--working-dir | Optional | Specify the directory where the current command runs. | /root |
--gpus | Optional | Specify the number of GPU cards that the job Worker node uses. | 0 |
--workers | Optional | Specify the number of job Worker nodes. | 1 |
--image | Required if --worker-image and --ps-image are not specified separately. | Specify the image address of the training environment. If --worker-image or --ps-image is not specified, both Worker nodes and PS nodes use this image address. | None |
--worker-image | Required if --image is not specified. | Specify the image address that the job Worker node uses. If --image is also present, it overwrites --image. | None |
--sync-mode | Optional | The code synchronization mode. You can specify git or rsync. This topic uses Git mode. | None |
--sync-source | Optional | The repository address for code synchronization. Use this parameter with --sync-mode. This example uses Git mode. This parameter can be any GitHub project address or a Git-supported code hosting address, such as an Alibaba Cloud Code project address. The project code downloads to the code/ directory under --working-dir. For this example, it is: /root/code/tensorflow-sample-code. | None |
--ps | Required for distributed jobs | Specify the number of parameter server (PS) nodes. | 0 |
--ps-image | Required if --image is not specified. | Specify the image address of the PS node. If --image is also present, it overwrites --image. | None |
--data | Optional | Mount the shared storage volume PVC to the running environment. It consists of two parts, separated by a colon ( Note Run If no PVCs are available, create one. For more information, see Configure NAS shared storage. | None |
--tensorboard | Optional | Enable a TensorBoard service for the training task for data visualization. Use --logdir to specify the event path that TensorBoard reads. If you do not specify this parameter, the TensorBoard service is not enabled. | None |
--logdir | Optional | Use this parameter with --tensorboard. This parameter specifies the path where TensorBoard reads event data. | /training_logs |
If you use a private Git repository, set the Git username and password by configuring the environment variables GIT_SYNC_USERNAME and GIT_SYNC_PASSWORD.
arena submit tf \
--name=tf-mnist-dist \
--namespace=default \
--working-dir=/root \
--ps=1 \
--ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--workers=2 \
--worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--gpus=1 \
--sync-mode=git \
--sync-source=https://github.com/kubeflow/arena.git \
--env=GIT_SYNC_BRANCH=master \
--env=GIT_SYNC_USERNAME=yourname \
--env=GIT_SYNC_PASSWORD=yourpwd \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /root/code/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"The arena command uses git-sync to synchronize source code. You can set the environment variables defined in the git-sync project.
This example pulls source code from a GitHub repository. If the code fails to pull due to network issues or other reasons, manually download the code to the shared storage system. The demo image provided in this topic already contains the example code /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py. You can submit the training job directly as follows:
arena submit tf \
--name=tf-mnist-dist \
--namespace=default \
--working-dir=/root \
--ps=1 \
--ps-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--workers=2 \
--worker-image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tensorflow-mnist-example:2.15.0-gpu \
--gpus=1 \
--data=training-data:/mnt \
--tensorboard \
--logdir=/mnt/tf_data/logs \
"python /code/github.com/kubeflow/arena/examples/tensorflow/mnist/main.py --data /mnt/tf_data/mnist.npz --dir /mnt/tf_data/logs"Step 3: View TensorFlow training jobs
View all training jobs submitted through Arena.
arena listExpected output:
NAME STATUS TRAINER AGE NODE tf-dist RUNNING TFJOB 58s 192.1xx.x.xxRun the following command to check the GPU resources used by the job.
arena top jobExpected output:
NAME GPU(Requests) GPU(Allocated) STATUS TRAINER AGE NODE tf-dist 2 2 RUNNING tfjob 1m 192.1xx.x.x tf-git 1 0 SUCCEEDED tfjob 2h N/A Total Allocated GPUs of Training Job: 2 Total Requested GPUs of Training Job: 3arena top jobExpected output:
NAME GPU(Requests) GPU(Allocated) STATUS TRAINER AGE NODE tf-dist 2 2 RUNNING tfjob 1m 192.1xx.x.x tf-git 1 0 SUCCEEDED tfjob 2h N/A Total Allocated GPUs of Training Job: 2 Total Requested GPUs of Training Job: 3arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx master ready 0 0 cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx master ready 0 0 cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx master ready 0 0 cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx <none> ready 2 1 cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx <none> ready 2 1 cn-huhehaote.192.1xx.x.xx 192.1xx.x.xx <none> ready 2 0 ----------------------------------------------------------------------------------------- Allocated/Total GPUs In Cluster: 2/6 (33%)arena get -n default tf-mnist-distExpected output
STATUS: RUNNING NAMESPACE: default PRIORITY: N/A TRAINING DURATION: 1m NAME STATUS TRAINER AGE INSTANCE NODE tf-dist RUNNING TFJOB 1m tf-dist-ps-0 192.1xx.x.xx tf-dist RUNNING TFJOB 1m tf-dist-worker-0 192.1xx.x.xx tf-dist RUNNING TFJOB 1m tf-dist-worker-1 192.1xx.x.xx Your tensorboard will be available on: http://192.1xx.x.xx:31870NoteThis topic shows an example that enables TensorBoard. In the job details above, the last two lines show the TensorBoard web endpoint. If you do not enable TensorBoard, these two lines do not appear.
Run the following command to check the GPU resources used by the cluster.
Execute the following command to get task details.
Step 4: View TensorBoard
View TensorBoard in a browser.
Run the following command locally to map the TensorBoard in the cluster to local port 9090.
Access
localhost:9090in your browser to view TensorBoard. The following image shows TensorBoard.
Note that port forwarding established by kubectl port-forward does not provide production-level reliability, security, or extensibility. Therefore, it is only suitable for development and debugging purposes and not for use in production environments. For more information about production-ready networking solutions within Kubernetes clusters, see Ingress management.
kubectl port-forward -n default svc/tf-dist-tensorboard 9090:6006Step 5: View training job logs
Run the following command to get job log information.
arena logs -n default tf-distExpected output:
WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 960: 0.9691
Accuracy at step 970: 0.9677
Accuracy at step 980: 0.9687
Accuracy at step 990: 0.968
Adding run metadata for 999
Total Train-accuracy=0.968When you use the preceding command to get job log information, logs of the worker-0 node are output by default. To view the logs of a specific node in a distributed training task, first view the job details to get the list of job nodes, then use the command arena logs $job_name -i $instance_name to view the logs of a specific instance.
The example code is as follows.
arena get tf-distExpected output:
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
tf-dist SUCCEEDED TFJOB 5m tf-dist-ps-0 192.16x.x.xx
tf-dist SUCCEEDED TFJOB 5m tf-dist-worker-0 192.16x.x.xx
tf-dist SUCCEEDED TFJOB 5m tf-dist-worker-1 192.16x.x.xx
Your tensorboard will be available on:
http://192.16x.x.xx:31870Run the following command to get job logs.
arena logs tf-dist -i tf-dist-worker-1Expected output:
WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
...
Accuracy at step 970: 0.9676
Accuracy at step 980: 0.968
Accuracy at step 990: 0.967
Adding run metadata for 999
Total Train-accuracy=0.967You can also view the real-time log output of the job by running the command arena logs $job_name -f. View the last N lines of logs by running the command arena logs $job_name -t N. Query more parameter usage by running arena logs --help.
The example code for viewing the last N lines of logs is as follows.
arena logs tf-dist -t 5Expected output:
Accuracy at step 9970: 0.9834
Accuracy at step 9980: 0.9828
Accuracy at step 9990: 0.9816
Adding run metadata for 9999
Total Train-accuracy=0.9816(Optional) Step 6: Clean up the environment
If you no longer need the training job after it finishes, run the following command to delete it:
arena delete -n default tf-mnist-distExpected output:
INFO[0002] The training job tf-mnist-dist has been deleted successfully