All Products
Search
Document Center

Platform For AI:Create and manage container training jobs

Last Updated:Dec 27, 2023

The Distributed Training Jobs page in the Platform for AI (PAI) console allows you to manage container training jobs in a visualized and centralized manner. The training jobs are powered by the Deep Learning Containers (DLC) module of PAI. This topic describes how to create and manage container training jobs.

Account and permission requirements

  • Alibaba Cloud account: You can use an Alibaba Cloud account to complete all operations without additional authorization.

  • RAM user: You need to add a Resource Access Management (RAM) user as a workspace member that has specific roles and assign permissions to the roles. For more information, go to the Roles and Permissions page. image.png

Create a container training job

You can create a DLC training job on the Distributed Training Jobs page.

  1. Go to the Distributed Training Jobs page

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose AI Computing Asset Management > Jobs to go to the Distributed Training Jobs page.

  2. On the Distributed Training Jobs tab, click Create Job.

  3. On the Create Job page, configure the parameters and click Submit.

    For more information about how to configure the parameters, see Submit jobs by using the console.

Manage container training jobs

The Distributed Training Jobs page displays distributed training jobs submitted on the DLC tab and by using DLC CLIs, and pipeline tasks submitted by using Machine Learning Designer. The following figure shows the instructions for managing jobs and tasks displayed on the page. 1427c12ceff2885134cb7dd17c7b782d.png

Warning

You cannot restore deleted DLC jobs. Proceed with caution.

  • ①: You can search for the training job that you want to manage by using Job Name, Job ID, Running Duration, Job Type, and Status.

  • ②: You can click the name of the job to go to the job details page, on which you can view the details of the job status, instance execution state, resource view, and logs.

  • ③: You can also move the pointer over the icon next to the state of the job to view the job state, as shown in the ③ section of the preceding figure.

  • ④: You can find the job that you want to manage and click Clone in the Actions column to duplicate the job. You can also click TensorBoard in the Actions column to create a TensorBoard instance for the job and view the visualized training results of the job on the TensorBoard page

Search for aggregated logs by keyword

Procedure

Perform the following steps to search for log events by keywords on the Aggregated Logs tab.

  1. In the left-side navigation pane, choose AI Computing Asset Management > Jobs. On the Distributed Training Jobs page, click the job name.

  2. Click the Aggregated Logs tab and configure the parameters.

    1. In the Job Information section, select a time range for log collection.

      Note

      The log may be collected later than the end time of the job. Select the time based on the actual situation.

    2. In the Instances section, select an instance.

    3. Enter a keyword in the input box on the right to search for related logs or events.

Basic search rules

DLC requires you to enter complete words as keywords to search for aggregated logs. Whereas, Simple Log Service (SLS) uses word segmentation to query logs. Phrases cannot be completely matched based on an exact search.

For example, if you use the keyword phrase abc def, the search results include all logs that contain abc or def. The logs that contain the complete phrase abc def cannot be matched.

Fuzzy search rules

When you search for aggregated logs by keywords, you can use asterisks (*) and question marks (?) to perform fuzzy search. Other special characters are invalid. The following section describes the rule details:

  • An asterisk (*) indicates zero or more occurrences of characters. A question mark (?) indicates one occurrence of a character.

  • You can add an asterisk (*) or a question mark (?) as a wildcard character to the middle or end of a keyword to perform a fuzzy search. A keyword that starts with a wildcard character is invalid.

For example, you can use the keyword abc* to search for words that start with abc, and the keyword ab?d to search for words that start with ab, end with d, and contain a single character in the middle.

Note

SLS searches all logs and obtains up to 100 strings that meet the specified conditions. Then, SLS returns the logs that contain one or more of the 100 strings and meet the search conditions. If the prefix is short, the number of matched words may exceed 100. In this case, only a part of matched logs are returned. The more accurate a keyword is, the more accurate the search results are.

Limits on delimiters

SLS for DLC uses the following common characters as delimiter:, '";=()[\",\"]{}?@&<>/:\n\t\r.

The delimiters are used to split the content of a log into multiple strings. Therefore, a string that contains only delimiters cannot be used as a keyword. No results are returned.

Example 1: The string &&& cannot be used as a keyword and no logs are returned. We recommend that you use another keyword based on the context of the keyword that you want to search for.

Example 2: If you want to search for logs that contain a&b, we recommend that you set the keyword to a&b instead of &. If you use the keyword a&b, logs that contain a or b are returned. The more detailed the keyword is, the more accurate the results are.

Example keywords

Requirement

Example keyword

Search for logs that contain Error.

Error

Search for logs that contain loss and acc.

loss acc

Fuzzy search for logs that contain Traceback.

Traceback*

Search for logs that contain abc&def.

abc&def