In most cases, enterprises need job results to be generated earlier than expected. This way, they can make business development decisions based on the results at the earliest opportunity. In this case, job developers must pay attention to the job status to identify and optimize the jobs that run slowly. You can use Logview of MaxCompute to diagnose jobs that run slowly. This topic provides the causes for which jobs run slowly and the related solutions. This topic also describes how to view information about the jobs that run slowly.

Background information

Logview of MaxCompute records all logs of jobs and provides guidance for you to view and debug jobs. You can obtain the Logview URL below Log view in the job result. MaxCompute provides two versions of Logview. We recommend that you use Logview V2.0 because it provides faster page loading and a better design style. For more information about Logview V2.0, see Logview V2.0.

Common causes for which jobs run slowly:
  • Insufficient CUs

    If the MaxCompute project uses the subscription billing method and a large number of jobs are submitted or a large number of small files are generated within a specific period of time, all the purchased compute units (CUs) are occupied and the jobs become queued.

  • Data skew

    If a large amount of data is processed or some jobs are dedicated for some special data, long tails may occur even if most jobs are completed.

  • Inefficient code logic

    If the SQL or user-defined function (UDF) logic is inefficient or parameter settings are not optimal, a Fuxi task may run for a long period of time. However, the time for which each Fuxi instance runs is almost the same. For more information about the relationships between jobs, Fuxi tasks, and Fuxi instances, see Job details.

Insufficient CUs

Problem description
If the CUs are insufficient, the following issues may occur after you submit a job:
  • Issue 1: Job Queueing... is displayed.

    The job may be queued because other jobs occupy the resources of the resource group. You can perform the following steps to view the duration for which the job is queued:

    1. Obtain the Logview URL in the job result and open the URL in the browser. Obtain the Logview URL
    2. On the SubStatusHistory tab of Logview, find Waiting for scheduling in the Description column and view the value in the Latency column. The value indicates the duration for which the job is queued. Job queuing duration
  • Issue 2: The job runs slowly.
    After a job is submitted, a large number of CUs are required. However, the resource group cannot start all Fuxi instances at the same time. As a result, the job runs slowly. You can perform the following steps to view the job status:
    1. Obtain the Logview URL in the job result and open the URL in a browser.
    2. In the Fuxi Instance section of the Job Details tab, click Latency chart to view the job status diagram.
      The following figure shows the status of a job that has sufficient resources. The lower blue part in the diagram remains at approximately the same height, which indicates that all Fuxi instances of the job start at approximately the same time. Job status-Sufficient resources

      The following figure shows the status of a job that does not have sufficient resources. The diagram shows an upward trend, which indicates that the Fuxi instances of the job are gradually scheduled.

      Job status-Insufficient resources
Causes
To locate the causes of the preceding issues, perform the following steps:
  1. Go to MaxCompute Management.
  2. In the left-side navigation pane, click Quotas. Quotas
  3. In the Subscription Quota Groups section, click the quota group that corresponds to the MaxCompute project.
  4. In the Usage Trend of Reserved CUs chart on the Resource Consumption tab, click the point at which the CU usage is the highest and record the point in time.
  5. In the left-side navigation pane, click Jobs. On the right part of the page, click the Job Management tab.
  6. On the Job Management tab, configure Time Range based on the point in time that you recorded, select Running from the Job Status drop-down list, and then click OK.
  7. In the job list, click the Descending icon next to CPU Utilization (%) to sort jobs by CPU utilization in descending order.
    • If the CPU utilization of a job is excessively high, click Logview in the Actions column and view I/O Bytes in the Fuxi Instance section. If I/O Bytes is only 1 MB or tens of KB and multiple Fuxi instances are running in the job, a large number of small files are generated when the job is run. In this case, you need to merge the small files or adjust the parallelism.
    • If the values of CPU Utilization (%) are almost the same, multiple large jobs are submitted at the same time and the jobs consume all CUs. In this case, you must purchase additional CUs or use pay-as-you-go resources to run jobs.
Solutions
  • Merge small files.
  • Adjust the parallelism.

    The parallelism of MaxCompute jobs is automatically estimated based on the amount of input data and the job complexity. In most cases, you do not need to manually adjust the parallelism. If you adjust the parallelism to a higher value, the job processing speed increases. However, subscription resource groups may be fully occupied. In this case, jobs are queued to wait for resources and therefore run slowly. You can configure the odps.stage.mapper.split.size, odps.stage.reducer.num, odps.stage.joiner.num, or odps.stage.num parameter to adjust the parallelism. For more information, see SET operations.

  • Purchase CUs.

    For more information about how to purchase CUs, see Upgrade resource configurations.

  • Use pay-as-you-go resources.

    Purchase pay-as-you-go resources and use MaxCompute Management to allow subscription projects to use the pay-as-you-go resources.

Data skew

Problem description

Some Fuxi instances in a Fuxi task continue to run even if most Fuxi instances of the Fuxi task stopped. As a result, long tails occur.

In the Fuxi Instance section of the Job Details tab of Logview, you can click Long-Tails to view the Fuxi instances that have a long tail.

Long tailCause

The Fuxi instances that continue to run process large amounts of data or are dedicated for special data.

Solution

For more information about how to resolve data skew, see Reduce impacts of data skew.

Inefficient code logic

Problem description
If the code logic is inefficient, the following issues may occur after you submit a job:
  • Issue 1: Data expansion occurs. The amount of output data of a Fuxi task is significantly greater than the amount of input data.

    You can view I/O Record and I/O Bytes in the Fuxi Task section to check the amounts of input and output data of a Fuxi task. In the following figure, 1 GB of data is changed to 1 TB after the data is processed. One Fuxi instance processes 1 TB of data, which reduces data processing efficiency.

    Amounts of input and output data
  • Issue 2: The UDF execution efficiency is low.
    A Fuxi task runs slowly, and the Fuxi task has UDFs. When a timeout error occurs on a UDF, the error Fuxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by bad udf performance is returned. You can perform the following steps to view the location and execution speed of the UDF:
    1. Obtain the Logview URL in the job result and open the URL in a browser.
    2. In the progress chart, double-click the Fuxi Task that runs slowly or fails to run. In the operator graph, view the location of the UDF, as shown in the following figure. View the location of the UDF
    3. In the Fuxi Instance section, click StdOut to view the execution speed of the UDF.

      In normal cases, the value of Speed(records/s) indicates that millions or hundreds of thousands of records are processed per second.

      StdOut
Causes
  • Issue 1: The business processing logic causes data expansion. In this case, check whether the business logic meets your business requirements.
  • Issue 2: The UDF code logic does not meet your business requirements. In this case, adjust the code logic.
Solutions
  • Issue 1: Check whether the business logic has a defect. If the logic has a defect, modify the code. If the logic does not have a defect, configure the odps.stage.mapper.split.size, odps.stage.reducer.num, odps.stage.joiner.num, or odps.stage.num parameter to adjust the parallelism. For more information, see SET operations.
  • Issue 2: Check and modify the UDF code logic. We recommend that you preferentially use built-in functions. If built-in functions cannot meet your business requirements, use UDFs. For more information about built-in functions, see Built-in functions.