Prerequisites

Before running a model training task, make sure you have performed the following operations:

  • Create a container cluster that contains a certain number of elastic computing resources (Elastic Compute Service (ECS) or EGS). For more information, see Create a container cluster.
  • To use Object Storage Service (OSS) to store data for model training, use the same account to create an OSS bucket, and create data volumes in the preceding container cluster to mount the OSS bucket as a local directory to the container in which you want to run the training task. For more information, see Create a data volume.

Conventions

To facilitate your application codes to read training data and output training logs, data in the training volume is stored in the /input directory. Your codes read data from this directory.

Procedure

  1. Log on to the Container Service console.
  2. Click Swarm > Images and Templates > > Solutionsin the left-side navigation pane.
  3. Click Launch in DevBox.


  4. Configure the basic information for creating a Jupyter environment.
    • Cluster: Select the cluster in which the created model development application is to be deployed. Select EGS-cluster in this example.
    • Application Name: Name of the created application, which can be 1–64 characters long and contain numbers, English letters, and hyphens (-), but cannot start with a hyphen (-).
    • Framework: The supported frameworks include TensorFlow, Keras, and Python.
    • GPUs: The number of GPUs in use. If this field is set to 0, no GPU is used.
    • Data Source: Select the data source used to store training data. Select the data volume created in the cluster by OSS or select Local Directory and then enter the absolute path.  You can also select No Data Source.
    • Jupyter Password: The password used to log on to Jupyter.
    • Enable Monitor: Select whether or not to use TensorBoard to monitor the training status. With this check box selected, enter the path of the training logs in the Log Directory field and make sure that the path is the same as the log output path in the training codes.
    • Enable SSH: Select whether or not to allow you to access services by using SSH. With this check box selected, enter your SSH Password.
    Note
    For how to access services by using SSH, see Access Jupyter services by using SSH.


  5. Click OK after completing the configurations.
  6. On the Application List page, click the name of the created application.


  7. Click the Routes tab. Two links starting with jupyter and tensorboard respectively are displayed.


  8. Click the link starting with jupyter and enter the Jupyter password to access the Jupyter environment.
  9. Click the link starting with tensorboard to view the training results.


  10. Training data in the distributed storage is stored in the local /input folder. You can read data from this folder.