You can run your model training codes by using the elastic computing resources and storage service provided by Alibaba Cloud to quickly perform standalone training iteration. During the training, you can check logs and monitor the training status at any time.
Before running a model training task, make sure you have performed the following operations:
- Create a container cluster that contains a certain number of elastic computing resources (Elastic Compute Service (ECS) or EGS). For more information, see Create a container cluster.
- To use Object Storage Service (OSS) to store data for model training, use the same account to create an OSS bucket. Then, create data volumes in the preceding container cluster to mount the OSS bucket as a local directory to the container in which you want to run the training task. For more information, see Create a data volume.
- Synchronize the model training codes to an available GitHub code repository.
To facilitate your training codes to read training data, output training logs, and store training iteration status data (checkpoints), note the following conventions:
- The training data is automatically stored in the /input directory, which is consistent with the path in OSS. Your codes read the training data from this directory.
- All the data output by codes, including logs and checkpoint files, is stored in the /output directory. All files stored in the /output directory are automatically synchronized to your OSS bucket with the same directory structure.
- If a special Python dependent repository is required for the training codes, write all dependencies to the requirements.txt configuration file and store the file to the code root directory in the GitHub repository.
- Log on to the Container Service console.
- Click in the left-side navigation pane.
- Click Launch in Training.
- Configure the basic information for the standalone training task.
Complete the following configurations.
- Cluster: Select the container cluster in which the standalone model training task will be run.
- Application Name: Name of the application created for running the standalone model training task, which can be 1–64 characters long and contain numbers, English letters, and hyphens (-). For example, tf-train-standalone.
- Framework: Select the framework used for model training. Currently, the supported frameworks include TensorFlow, Keras, Python, and customized image. If Customized Image is selected, enter the valid image address in the Image field.
- Distributed Training: Select whether or not to perform distributed training tasks. Here clear this check box to perform standalone training.
- GPUs Per Worker: Specify the number of GPUs in use. Setting this parameter to 0 indicates CPU, instead of GPU, is used for training.
- Data Source: Select the data source used to store training data. Select the data volume created in the cluster by OSS or select Local Directory and then enter the absolute path. You can also select No Data Source. In this example, the tfoss data volume is selected.
- Git URL: Specify the address of the GitHub code repository in which the training codes are stored.
Note Currently, HTTP and HTTPS are supported, while SSH is not supported.
- Private Git Information: Select this check box if you use the private code repository. Then, configure the Git Username (your GitHub account) and Git Password.
- Command: Specify the command used to run the preceding codes to perform model training.
Note If you select a framework supporting Python3, directly call
python3, instead of
python, in the command line.
- Enable Monitor: Select whether or not to use TensorBoard to monitor the training status. With this check box selected, enter a valid path where the training logs will be stored in the Log Directory field and make sure that the path is the same as the log output path in the training codes. For example, if you set Log Directory as /output/training_logs, your codes will output the logs to the same path. As shown in the preceding figure, the path consistency is guaranteed by specifying the code command line parameter --log_dir.
Note Here the training logs indicate the event files output by the TensorFlow API for TensorBoard, and the checkpoint files that store the model status.
- After completing the configurations, click OK to create the application used to perform model training tasks.
- Click Applications in the left-side navigation pane. Select the cluster from the Cluster drop-down list and then click the name of the created application (tf-train-standalone in this example).
- Click the Routes tab. A link starting with tensorboard is displayed.
Click this link to open the TensorBoard monitoring page. You can view the training monitoring data.
- Click the Logs tab to view the logs that are output to stdout/stderr by your codes when the training tasks are being performed.
Note Generally, to run the training task application, multiple service containers are automatically created to run different program branches respectively. For example, the worker container is generally used to run your model codes, while the tensorboard container is used to run TensorBoard training monitoring. You can click the Services or Containers tab to view the created services/containers.
You can filter the logs by service container name, for example,
tf-train-standalone_worker_1, and view detailed logs output to stdout/stderr by the container operating program. You can also filter the logs by time and display quantity, and select to download the log file to a local directory.
Note The stdout/stderr logs mentioned here are different from the preceding TensorBoard event files and checkpoint files.
- Click the Services tab. If the status of the worker service is Stopped, the training task is completed.
- In the OSS client, you can check that the training results are automatically stored in the OSS bucket.
After the training is completed, the system automatically copies all files stored in the local /output directory to the OSS bucket corresponding to your specified data volume. In the OSS client, you can check that the event files and checkpoint files written to the /output directory have been backed up to the OSS bucket.
- Manage the training task.
To stop, restart, or delete a training task (for example, to release computing resources for a new training task), click Applications > in the left-side navigation pane to return to the Application List page, find the corresponding application, and perform operations on the right.
Note By using the model training service described in this document, you cannot only train a model from scratch, but also use new data to continue the training (for example, fine tuning) based on an existing model (checkpoint). You can adjust the hyper-parameters and perform iterative training by continuously updating the configurations of an existing application.