Use experiment management in Model Gallery - Platform For AI

Prerequisites

TensorBoard metric visualization requires an OSS bucket. For instructions, see Create a bucket in the console.

Experiment Management is free of charge. However, training a model in Model Gallery and associating the task with an experiment incurs DLC training and OSS storage fees. For more information, see DLC billing and OSS billing overview.
Up to five TensorBoard instances are free. Additional instances incur charges.

Associate a model fine-tuning task with a new or existing experiment when creating the task in Model Gallery.

On the model details page, click Train.
On the fine-tuning details page, in the Experiment Configuration section, configure the experiment association.
1. To associate the task with a new experiment, select Create Experiment and specify the Experiment Name and Experiment Output Path.
  
  Note
  Experiment Output Path sets the default path for all output data from associated tasks, including models and TensorBoard logs.
  
  To customize the task output path, configure it in the Output Data Configuration section. However, modifying the default TensorBoard path prevents cross-task metric comparison in the experiment's TensorBoard instance. We recommend keeping the default path.
2. Alternatively, associate the task with an Existing Experiments.
  
  Select an experiment from the drop-down list, for example, exp-bert.
Configure the remaining fine-tuning task parameters. For more information, see Model deployment and training.
Click Train.

The page redirects to the Task details page, where the associated experiment name and task metadata such as hyperparameters are displayed.

Compare training metrics such as train_loss and total_flos across tasks in the same experiment by using TensorBoard.

On the Model Gallery homepage, click Job Management.
On the Job Management page, under All Experiments, find your target experiment and click Tensorboard in the Operation column.

A TensorBoard instance launches automatically.

The View TensorBoard dialog box appears, showing the instance's Name, Output Path, Status, and URL. While the instance is being created, the Status is Creating and the URL is a hyphen (-). The URL is generated after the instance is created. You can manage the instance by using the Delete, Stop, or Close buttons at the bottom.
After the TensorBoard status changes to In operation, click Go to. The TensorBoard interface opens in a new browser tab.

This page displays the metrics for all training jobs associated with the experiment. The metrics logged can vary depending on the model.

Change the horizontal axis of the chart by selecting an option under Horizontal Axis.
- STEP: Training step number.
- RELATIVE: Elapsed time since training started, in hours. Example: 0.5 hours.
- WALL: Absolute wall-clock time. Example: 10:00 AM on April 2, 2024.
Common metrics:
- loss: Difference between model predictions and ground truth.
- accuracy/precision/recall: Model performance metrics.
Select or clear a task's checkbox to include or exclude it from the comparison.
If metric values are similar across tasks, click the button at the bottom center of the chart to zoom in on the area with the most significant differences.
Click the leftmost button to view the chart in full screen.