All Products
Search
Document Center

Platform For AI:Commands used to query logs or jobs

Last Updated:Jan 12, 2024

You can use the Deep Learning Containers (DLC) client to view DLC job logs, job lists, and job details. This topic describes the details about the commands that are used to query logs or jobs, including the syntax and the parameters. This topic also provides examples.

The logs command

  • Feature description

    The command is used to query the logs of a training job.

  • Syntax

    ./dlc logs <yourJobId> <yourPodId> [--max_events_num <yourMaxNum>] [--start_time <yourStartTime>] [--end_time <yourEndTime>]
  • Parameters

    Parameter

    Required

    Description

    Type

    <yourJobId>

    Yes

    The ID of the training job that you want to query.

    STRING

    <yourPodId>

    Yes

    The ID of the pod whose logs you want to view. You need to specify multiple pods in scenarios where distributed jobs are created.

    STRING

    max_events_num <yourMaxNum>

    No

    The maximum number of log entries to return. Default value: 2000.

    INT

    start_time <yourStartTime>

    No

    The start time of the query. The default value is 7 days before the current time. Example: start_time 2020-11-08T16:00:00Z.

    STRING

    end_time <yourEndTime>

    No

    The end time of the query. The default value is the current time. Example: end_time 2020-11-08T17:00:00Z.

    STRING

  • Examples

    Obtain 10 lines of logs for Worker Node 0 of a distributed training job.

    ./dlc logs dlcdys3r9jlu**** dlcdys3r********-worker-0 --max_events_num 10

    The system returns information similar to the following output:

    WARN: ./requirements.txt not found, skip installing requirements.
    ================================================
    |  PAI Tensorflow powered by Aliyun PAI Team.  |
    ================================================
    Network is under initialization...
    Network successfully initialized.
    [2021-04-16 12:27:56.368026] [INFO] [7#7] [tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    [2021-04-16 12:27:56.375586] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:80] ====================CPU Architecture=====================
    [2021-04-16 12:27:56.375600] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:84] Disable AVX512.
    [2021-04-16 12:27:56.375605] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:87] CPU Vendor ID: GenuineIntel

The get job command

  • Feature description

    The command is used to obtain information about a training job. If you do not specify a job ID, all jobs are queried. If you specify a job ID, only the specified job is queried.

  • Syntax

    ./dlc get job [JOB_ID] [--workspace_id <yourWorkspaceId>] [--display_name <yourJobName>] [--job_type <yourJobType>] [--status <yourJobStatus>] [--start_time <yourStartTime>] [--end_time <yourEndTime>] [--page_num <yourPageNum>] [--page_size <yourPageSize>] [--max_events_num <yourMaxNum>] [--events] [--events_only]
  • Parameter description

    Parameter

    Required

    Description

    Type

    JOB_ID

    No

    The ID of the training job that you want to query.

    STRING

    workspace_id <yourWorkspaceId>

    No

    The workspace ID.

    STRING

    display_name <yourJobName>

    No

    The name of the job. Fuzzy query is supported. The name is case-insensitive. Wildcards are not supported.

    STRING

    job_type <yourJobType>

    No

    The type of the job. You can query jobs of all types. This parameter is empty by default, which indicates all types.

    STRING

    status <yourJobStatus>

    No

    The status of the job. Valid values: This parameter is empty by default, which indicates all states.

    STRING

    start_time <yourStartTime>

    No

    The start time of the query. Example: start_time 2022-08-04T02:09:32Z.

    STRING

    end_time <yourEndTime>

    No

    The end time of the query. Example: end_time 2022-08-04T02:09:32Z.

    STRING

    page_num <yourPageNum>

    No

    The number of the page to return for the current query. Page numbers start from 1. Default value: 1.

    INT

    page_size <yourPageSize>

    No

    The number of entries to return on each page. Default value: 10.

    INT

    max_events_num <yourMaxNum>

    No

    The maximum number of rows of system events to return. Default value: 2000.

    INT

    events

    No

    Specifies whether to query the system events of a job. This parameter takes effect only when a single job is queried. Default value: false.

    BOOL

    events_only

    No

    Specifies whether to query only the system events of a job. This parameter takes effect only when a single job is queried. Default value: false.

    BOOL

  • Examples

    • Query training jobs by name based on fuzzy match.

      ./dlc get job --display_name epl

      The system returns information similar to the following output:

      +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
      |        Name        |      JobId       | WorkspaceId |  WorkspaceName   | ResourceId |  ResourceName  | JobType | Priority | JobStatus |      UserId      |      CreateTime      |    SubmittedTime     |     RunningTime      |    SuccessedTime     | StoppedTime | FailedTime |      FinishTime      | Duration(seconds) |
      +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
      | test_epl_test-**** | dlc02xipvt5z**** | 23****      | doc_test_**** |            | public-cluster | TFJob   | 1        | Succeeded | 144963168668**** | 2022-08-01T06:41:05Z | 2022-08-01T06:45:08Z | 2022-08-01T06:48:57Z | 2022-08-01T06:53:21Z |             |            | 2022-08-01T06:53:21Z | 736               |
      | test_epl_****      | dlc1iyv3szl2**** | 23****      | doc_test_**** |            | public-cluster | TFJob   | 1        | Succeeded | 144963168668**** | 2022-08-01T03:23:51Z | 2022-08-01T03:27:22Z | 2022-08-01T03:27:50Z | 2022-08-01T03:33:48Z |             |            | 2022-08-01T03:33:48Z | 597               |
      +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
    • Query a specified training job.

      ./dlc get job dlc02xipvt5z****

      The system returns information similar to the following output:

      {
         "ClusterId": "",
         "CodeSource": {
            "Branch": "main",
            "CodeSourceId": "code-29****c****c4****ae0c9ec75a5****",
            "MountPath": ""
         },
         "DataSources": [
            {
               "DataSourceId": "d-ya7gc2p2iqq240****",
               "MountPath": ""
            }
         ],
         "DisplayName": "test_epl_test-****",
         "Duration": 736,
         "ElasticSpec": {
            "AIMasterType": "",
            "EnableElasticTraining": false,
            "MaxParallelism": 0,
            "MinParallelism": 0
         },
         "EnabledDebugger": false,
         "GmtCreateTime": "2022-08-01T06:41:05Z",
         "GmtFinishTime": "2022-08-01T06:53:21Z",
         "GmtRunningTime": "2022-08-01T06:48:57Z",
         "GmtSubmittedTime": "2022-08-01T06:45:08Z",
         "GmtSuccessedTime": "2022-08-01T06:53:21Z",
         "JobId": "dlc02xipvt5z****",
         "JobSpecs": [
            {
               "AssignNodeSpec": {
                  "EnableAssignNode": false,
                  "NodeNames": ""
               },
               "EcsSpec": "ecs.gn6v-c8g1.2xlarge",
               "Image": "registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:1.15-gpu-py36-cu100-ubuntu1****",
               "PodCount": 2,
               "ResourceConfig": {
                  "CPU": "",
                  "GPU": "",
                  "GPUType": "",
                  "Memory": "",
                  "SharedMemory": ""
               },
               "Type": "Worker",
               "UseSpotInstance": false
            }
         ],
         "JobType": "TFJob",
         "Pods": [
            {
               "GmtCreateTime": "2022-08-01T06:45:08Z",
               "GmtFinishTime": "2022-08-01T06:53:20Z",
               "GmtStartTime": "2022-08-01T06:52:06Z",
               "Ip": "10.224.xx.xx",
               "PodId": "dlc02xipvt5z****-worker-0",
               "PodUid": "",
               "Status": "Succeeded",
               "Type": "worker"
            },
            {
               "GmtCreateTime": "2022-08-01T06:45:08Z",
               "GmtFinishTime": "2022-08-01T06:53:20Z",
               "GmtStartTime": "2022-08-01T06:48:57Z",
               "Ip": "10.224.xx.xx",
               "PodId": "dlc02xipvt5z****-worker-1",
               "PodUid": "",
               "Status": "Succeeded",
               "Type": "worker"
            }
         ],
         "ReasonCode": "JobSucceeded",
         "ReasonMessage": "TFJob dlc02xipvt5z**** successfully completed.",
         "RequestId": "76FC3500-xxxx-533F-B24A-AC9B2A72****",
         "ResourceId": "",
         "Priority": 1,
         "ResourceLevel": "",
         "Settings": {
            "BusinessUserId": "",
            "Caller": "",
            "EnableErrorMonitoringInAIMaster": false,
            "EnableTideResource": false,
            "ErrorMonitoringArgs": "",
            "PipelineId": ""
         },
         "Status": "Succeeded",
         "ThirdpartyLibDir": "",
         "UserCommand": "cd /root/xxxx/xxxx/\npip install .\ncd examples/resnet\nbash scripts/xxxx_dp.sh",
         "UserId": "144963168668****",
         "WorkspaceId": "23****",
         "WorkspaceName": "doc_test_****"
      }

References

You can view job details in the console. For more information, see View training details.