All Products
Search
Document Center

Platform For AI:Create and manage distributed training jobs

Last Updated:Apr 01, 2026

Create distributed training jobs in Deep Learning Containers (DLC), monitor execution status, search logs by keyword, and clone or delete jobs.

Prerequisites

Before you begin, ensure that you have:

  • An Alibaba Cloud account (no additional authorization required)

  • A RAM user added as a workspace member with a role that has the required permissions. For each role's permissions, see Appendix: Roles and permissions list.

Create a training job

  1. Log on to the PAI console.

  2. In the left navigation pane, click Workspace List, and then click the workspace name.

  3. In the left navigation pane of the workspace page, choose AI Asset Management > Jobs.

  4. On the Deep Learning Containers (DLC) tab, click Create Job.

  5. Configure the parameters and click OK.

For parameter descriptions, see Create a training task.

Manage training jobs

The job list aggregates jobs from DLC, Designer algorithm nodes running on DLC, and the DLC command line interface (CLI):

a95d0b5d2be165babe046176dcf0cdc8
Warning

Deleted jobs cannot be recovered.

CalloutAction
Search jobs by name, ID, time range, framework, or status.
Click a job name to view execution status, instance status, resource view, and logs.
Hover over the status icon to view execution status.
Clone a job, or click TensorBoard in the Actions column to create a TensorBoard instance for viewing training results.

Search logs by keyword

Run a log search

  1. In the left navigation pane, choose AI Asset Management > Jobs. On the Deep Learning Containers (DLC) page, click the job name.

  2. Click the Log tab.

  3. Above Job Information, select a time range for log collection.

    Note

    Log collection may extend beyond the job end time. Select a time range that covers the full period you want to inspect.

  4. In Instance List, select the instances to include.

  5. In the search box, enter keywords to search for logs or events.

Basic query rules

SLS tokenizes log content before indexing. A multi-word keyword like loss acc matches logs that contain loss and logs that contain acc — it does not search for the exact phrase loss acc.

SLS does not support exact phrase matching. Use the most specific single token that identifies the log entry you are looking for.

Example — multi-word keyword behavior:

KeywordReturnsDoes not return
loss accLogs containing loss AND logs containing accLogs that contain only the exact phrase loss acc
Error 404Logs containing Error AND logs containing 404Logs that contain only the exact phrase Error 404

Fuzzy query rules

Use * (asterisk) or ? (question mark) for wildcard searches. Other special characters are not supported.

WildcardMatchesExample
*Multiple charactersabc* matches abcd, abcdef
?Exactly one characterab?d matches abcd but not abcdef

Wildcards must appear in the middle or at the end of a keyword — not at the start. *abc and ?abc return no results.

Note

A fuzzy query searches up to 100 matching terms in the Logstore. If a short prefix like a* matches more than 100 terms, results may be inaccurate. Use a more specific prefix to narrow the match.

Tokenizer limitations

SLS splits log content into tokens using these delimiter characters:

, ' " ; = ( ) [ ] { } ? @ & < > / : \n \t \r

Keywords made up entirely of delimiters return no results. Build keywords from the surrounding content instead.

Query goalWhat to useWhy
Find logs with & &&No direct match possible& && consists only of delimiters
Find logs with a &ba &bReturns logs containing both a and b

More specific keywords produce more accurate results.

Query examples

Query goalKeyword
Logs containing ErrorError
Logs containing both loss and accloss acc
All logs related to TracebackTraceback*
Logs containing abc &defabc &def