After you add clusters of Alibaba Cloud Container Service for Kubernetes (ACK) to Deep Learning Containers (DLC), the clusters are installed with DLC Dashboard. You can manage deep learning jobs in DLC Dashboard.

Prerequisites

An ACK cluster is added to DLC. For more information, see Add an ACK cluster.

Background information

DLC Dashboard allows you to manage TensorFlow jobs created based on public images. If you want to manage other types of jobs, you must use Arena. For more information, see Arena official documentation.

Submit a job

  1. Log on to DLC Dashboard.
    1. Log on to the Machine Learning Platform for AI console.
    2. In the left-side navigation pane, choose Model Training > DLC-Cloud-native Deep Learning Model Training.
    3. On the DLC page, find the target cluster, and click Cluster Console in the Actions column.
  2. In the left-side navigation pane of DLC Dashboard, click Submit Job.
  3. On the Submit Job page, set the following parameters.
    Section Parameter Description
    Basic Information Job Name The name of the training job. The job name must be 2 to 30 characters in length and start with a lowercase letter.
    Type By default, TensorFlow is selected and cannot be changed.
    Job Details Source Code Set the following parameters based on the path where your source code is stored:
    • If your source code is stored in a repository, select Repository, and specify Repository Address and Branch.
      Note DLC automatically downloads the source code to the directory /workspace. Therefore, your account must be granted the permissions to access the repository.
    • If your source code is stored in a volume mounted to the ACK cluster, select Mounted Volume. Then, select the volume from the PVCs list.
    Command Python commands are supported. You can pass the path of the training dataset (such as data_dir) as a parameter to the entry function in the source code.
    Worker Number of Workers Specify the number of nodes to run the job:
    • For a standalone job, you can select the default node Worker.
    • For a distributed job, click Add Node on the right side of Worker, and select PS from the drop-down list.
      Note Parameter Server (PS) nodes do not support GPU resources.
    Image Select a public image. Make sure that the PS and Worker nodes use the same image. This means that the images for both types of nodes must use the same TensorFlow and Python versions. However, the images can use different CPU and GPU resources.
    Resource Specify CPU vCores, Memory, and GPUs.
    Note The value of GPUs must not be greater than the number of GPUs provided by the ACK cluster.
  4. Optional:On the Submit Job page, configure the following optional parameters as needed.
    Section Description
    Environment Variables Define environment variables in key-value pairs. You can refer to these environment variables in the source code.
    Storage Configuration If your training data is stored in the volume mounted to your ACK cluster, you can add the logic for retrieving the training data to your source code. DLC Dashboard allows you to bind the volume mounted to your ACK cluster to your training job. Then, you can pass the path where your training data is stored as a parameter to the entry function of the training job.
    Note The volume mounted to your ACK cluster must be in the pai-dlc-user namspace namespace. Otherwise, the volume is not displayed in the Storage Configuration list.
  5. Click Submit in the lower-right corner of the page.

Query jobs

DLC Dashboard allows you to query jobs by name, time range, and status.

  1. Log on to DLC Dashboard.
    1. Log on to the Machine Learning Platform for AI console.
    2. In the left-side navigation pane, choose Model Training > DLC-Cloud-native Deep Learning Model Training.
    3. On the DLC page, find the target cluster, and click Cluster Console in the Actions column.
  2. In the left-side navigation pane of DLC Dashboard, click Jobs.
  3. On the Jobs page, specify Time Range and click Search.
  4. In the Job List on the Jobs page, click the name of a job in the Name column.
  5. Then, you can check details about the job on the Job Details page.