All Products
Search
Document Center

Platform For AI:View training jobs

Last Updated:Jan 08, 2024

After you submit a training job, you can obtain the job status by viewing the basic information and configurations, events, resource views, and job logs.

View the basic information and configurations of a job

  1. Go to the Distributed Training Jobs page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Development and Training > Deep Learning Containers (DLC).

  2. Find the job that you want to manage and click Monitoring in the Actions column. The job details page appears.

  3. On the Details page, you can view the basic information and related configurations of the job, including the execution configuration and resource configuration. image.png

View job events

You can view the job scheduling logs and resource-related logs in DLC and use the logs when you troubleshoot issues.

  • View the job event logs:

    On the Event tab of the Details page, view the event logs of the job. image.png

  • View node event logs:

    On the Instance tab of the Details page, find the instance that you want to manage and click Log in the Actions column. On the Event tab of the page that appears, view the event logs of the instance. image.png

View the resource view

On the Resource View tab, you can view the following metrics: GPU usage, GPU memory usage, CPU usage, memory usage, and Network I/O. You can use the Resource View tab to monitor the resource usage of the job in real time. This helps you understand the resource requirements of the job, monitor resource usage, and allocate resources in a cost-effective manner.

View the resource views on the Resource View tab of the Details page. image.png

View the job logs

When a job is running unexpectedly or you need to view the job execution history, you can view the job logs to obtain the key information during job execution. You can view the logs in the following methods:

  • On the Instance tab of the Details page, find the instance that you want to manage and click Log in the Actions column. On the Log tab of the page that appears, view the event logs of the instance. image.png

  • On the Aggregated Logs tab of the Details page, use a keyword to search for log events. For more information, see the "Search for aggregated logs by keyword" section in the Create and manage container training jobs topic.

Query behavior events

Platform for AI (PAI) is integrated with ActionTrail. You can view and search for the DLC behavior events of your Alibaba Cloud account in the last 90 days in ActionTrail.

References

You can manage training jobs based on the job status. For more information, see Manage training jobs.