TensorBoard visualizes training metrics, model weights, and sample outputs from Deep Learning Containers (DLC) jobs.
Prerequisites
Before you begin, make sure that you have:
-
A DLC training job with at least one dataset mounted (verify on the job's Overview tab). Supported storage types: Object Storage Service (OSS) and Apsara File Storage NAS
-
TensorFlow's
SummaryWriterintegrated into the training script to write logs to a directory within the mounted dataset
Write training logs
Add a SummaryWriter to the training script. Write logs to a path inside the mounted dataset, and close the writer after training completes to prevent data loss.
Example setup:
-
OSS dataset endpoint:
oss://your-bucket-name.oss-region-internal.aliyuncs.com/dlc-training-data/ -
Dataset mount path in the container:
/mnt/data/ -
Log output directory:
/mnt/data/output/runs
import tensorflow as tf
# Initialize SummaryWriter with the logging directory
log_dir = "/mnt/data/output/runs"
writer = tf.summary.create_file_writer(log_dir)
# Training loop with metric logging
for epoch in range(num_epochs):
train_loss = calculate_loss()
train_accuracy = calculate_accuracy()
with writer.as_default():
# Log scalar metrics
tf.summary.scalar("train_loss", train_loss, step=epoch)
tf.summary.scalar("train_accuracy", train_accuracy, step=epoch)
# Log histograms and images every 10 epochs
if epoch % 10 == 0:
tf.summary.histogram("model_weights", model.weights, step=epoch)
tf.summary.image("prediction_samples", sample_predictions, step=epoch)
# Close the writer to ensure all logs are saved
writer.close()
print(f'Training completed successfully. Logs saved to: {log_dir}')
The SummaryWriter path (/mnt/data/output/runs) is the Summary Path to specify when creating the TensorBoard instance.
Create a TensorBoard instance
-
Log on to the PAI console. Select the region in the top navigation bar and select a workspace. Click Enter Deep Learning Containers (DLC).
-
In the Actions column of the target job, click TensorBoard. In the TensorBoard panel, click Create TensorBoard.
-
On the Create TensorBoard page, configure the parameters described in the following sections, then click OK.
Basic information
| Parameter | Description |
|---|---|
| Name | Name of the TensorBoard instance. |
| Datasets | Select a Configuration Type and specify the Summary Path (the directory containing TensorBoard summary logs defined in the SummaryWriter class of the training code). |
The following table describes the Configuration Type options.
| Option | Description |
|---|---|
| Mount Dataset | Select a dataset and enter the relative path of the summary directory in the dataset. This is the default option. |
| Mount OSS | Select an OSS storage path and enter the relative path of the summary directory in OSS. |
| By Task | Select a DLC job and enter the complete path of the log files in the container. |
Resource configuration
| Resource type | Description |
|---|---|
| Free Quota | Free resources provided by the system. Each instance can use up to 2 vCPUs and 4 GiB of memory. To free up quota, disable instances that run on free quotas, then use the released resources for a new instance. |
| General Computing | Choose between Public Resources and Resource Quota. See the following table for details. |
The following table describes the General Computing options.
| Option | Description |
|---|---|
| Public Resources | Pay-as-you-go billing. Only general computing uses public resources. Select an instance type based on workload requirements. |
| Resource Quota | Subscription billing. Before selecting this option, purchase computing resources and create quotas. This option is available only to users in the whitelist. Contact the account manager to enable it. |
When using Resource Quota, configure these additional parameters:
| Parameter | Description |
|---|---|
| Priority | Priority of the TensorBoard instance. Valid values: 1 to 9. The value 1 indicates the lowest priority. |
| Job Resource | Resources allocated to the TensorBoard instance: vCPUs and Memory (GiB). |
VPC configuration
VPC parameters are available only when using Public Resources.
If the TensorBoard instance uses a dataset that requires a VPC, such as a Cloud Parallel File Storage (CPFS) dataset or a NAS dataset with a mount target in the VPC, configure a VPC.
Without a VPC, the system uses an Internet connection. This may cause delays during TensorBoard startup or when viewing reports due to limited bandwidth. To get sufficient bandwidth and stable performance, configure a VPC.
Select a VPC, a vSwitch, and a security group in the current region. After configuration, the cluster running the TensorBoard instance can access services in the selected VPC and uses the specified security group for access control.
View TensorBoard reports
-
In the left-side navigation pane of the workspace page, choose .
-
On the TensorBoard tab, check the Status column. If the status is Running, click View TensorBoard in the Actions column.
The TensorBoard visualization page opens, displaying training metrics and logged data.
Manage TensorBoard instances
-
Log on to the PAI console. Select the region in the top navigation bar and select a workspace. Click Enter Jobs.
-
On the TensorBoard tab, perform any of the following operations.
| Operation | Steps |
|---|---|
| Start an instance | Click Start in the Actions column to restart a stopped instance. |
| View instance details | Click the instance name to open the details page. The page shows Basic Information and Configuration Information. |
| View associated DLC jobs | In the Associated Task column, hover over the icon to view the job ID. Click the ID to go to the job details page. |
| View associated datasets | In the Associated Dataset column, hover over the icon to view the dataset ID. Click the ID to go to the dataset details page. |
| View running duration | The Running Duration column shows how long the instance has been running since the last start. The duration resets when the instance is stopped. |
| Stop an instance | Click Stop in the Actions column. To schedule an automatic stop time, click Auto-stop Settings in the Actions column. |
Troubleshooting
| Issue | Solution |
|---|---|
| TensorBoard fails to start | Verify dataset mounting on the job's Overview tab. Confirm the log path exists within the mounted storage. Check Resource Access Management (RAM) permissions for dataset access. |
| Reports load slowly | Reduce log frequency for high-volume experiments. Consider using incremental log processing or organizing logs into smaller subdirectories. |
| Access denied or connectivity errors | Validate network connectivity and RAM permissions. Check security group configurations. If using a VPC-dependent dataset, make sure the TensorBoard instance is configured with the same VPC. |
| High CPU or memory usage | Monitor resource utilization. If using Free Quota, consider switching to General Computing with more resources. |
Best practices
-
Start with 2 to 4 vCPUs and 4 to 8 GiB of memory for typical workloads.
-
Organize logs by experiment type and timestamp (for example,
/mnt/data/output/runs/experiment_type/date_time/). -
Clean up obsolete training logs regularly to reduce storage costs and improve performance.
-
Use RAM role assignments to restrict TensorBoard instance access to authorized team members.
References
-
Create and manage TensorBoard instances: Create a TensorBoard instance from the page.