Long tails may occur in JOIN operations and other computing jobs. This topic describes the scenarios in which long tails occur and the solutions.

Long tails are one of the common issues in distributed computing. The main cause of a long tail is uneven data distribution. As a result, the workloads of individual nodes differ. The entire job can be completed only after the slowest node processes all its data.

To prevent one worker from running a large number of jobs, the jobs must be distributed to multiple workers.

GROUP BY long tail

Causes

The computing workloads for the key of a GROUP BY clause are heavy.

Solution

You can use one of the following methods to handle this issue:
  • Rewrite the SQL statement and add random numbers to split the key. Example:
    SELECT Key,COUNT(*) AS Cnt FROM TableName GROUP BY Key;

    Regardless of combiners, a mapper shuffles data to a reducer, and the reducer performs the count operation. The execution plan is in the following sequence: Mapper > Reducer. However, if the jobs of the long-tailed key are distributed again, use the following statement:

    -- Assume that the long-tailed key is KEY001.
    SELECT a.Key
      , SUM(a.Cnt) AS Cnt
    FROM (
      SELECT Key
        , COUNT(*) AS Cnt
    FROM TableName
    GROUP BY Key, 
      CASE 
        WHEN Key = 'KEY001' THEN Hash(Random()) % 50
        ELSE 0
       END
    ) a
    GROUP BY a.Key;

    The execution plan for this statement is in the following sequence: Mapper > Reducer > Reducer. Although more steps are required for the execution, the jobs of the long-tailed key are processed in two steps, and the time required may be shorter.

    Note If you use this method to add a reducer execution step to handle a long tail that has slight impacts on your jobs, the time required may be longer.
  • Specify system parameters.
    set odps.sql.groupby.skewindata=true 

    This configuration is used for general optimization instead of business-specific optimization. Therefore, the optimization effect may not be optimal. You can rewrite SQL statements in a more efficient way based on your data.

DISTINCT long tail

If a long tail occurs for the DISTINCT keyword, the key splitting method does not apply. In this case, you must seek for other methods.

Solution

-- The original SQL statement, regardless of the case where uid is not specified.
SELECT COUNT(uid) AS Pv
    , COUNT(DISTINCT uid) AS Uv
FROM UserLog;
The preceding statement can be rewritten into the following statement:
SELECT SUM(PV) AS Pv
    , COUNT(*) AS UV
FROM (
    SELECT COUNT(*) AS Pv
      , uid
    FROM UserLog
    GROUP BY uid
) a;

This method is to change DISTINCT to COUNT. This way, the computing workloads are distributed to different reducers. After you rewrite the statement, you can use the optimization method for GROUP BY, and the combiner is involved in the computation. This greatly improves the performance.

Long tail of dynamic partitions

Causes

  • To sort the data of small files, the dynamic partition feature starts a reducer at the final stage of execution. If data written by using the dynamic partition feature is skewed, a long tail occurs.
  • In general, the incorrect use of the dynamic partition feature causes long tails.

Solution

If you are sure about the partition to which data is written, you can specify the partition before you insert the data instead of using dynamic partitions.

Use combiners to handle long tails

Combiners are frequently used to handle long tails in MapReduce jobs. Combiners can be used to reduce the amount of data that needs to be shuffled from mappers to reducers. This greatly reduces the overheads of network transmission. This optimization is automatically implemented in MaxCompute SQL.

Note Combiners only optimize execution in the map stages. Make sure that the results of the execution during which combiners are used are the same as those of the execution during which combiners are not used. WordCount is used in this example. The result of passing (KEY,1) twice is the same as that of passing (KEY,2) once. For more information, see WordCount. However, when you calculate the average value, you cannot use a combiner to directly combine (KEY,1) and (KEY,2) to obtain (KEY,1.5).

Optimize the system to handle long tails

In addition to combiners, MaxCompute is also optimized. For example, the following logs (+N backups) are generated during the running of a job.

M1_Stg1_job0:0/521/521[100%] M2_Stg1_job0:0/1/1[100%] J9_1_2_Stg5_job0:0/523/523[100%] J3_1_2_Stg1_job0:0/523/523[100%] R6_3_9_Stg2_job0:1/1046/1047[100%] 
M1_Stg1_job0:0/521/521[100%] M2_Stg1_job0:0/1/1[100%] J9_1_2_Stg5_job0:0/523/523[100%] J3_1_2_Stg1_job0:0/523/523[100%] R6_3_9_Stg2_job0:1/1046/1047[100%] 
M1_Stg1_job0:0/521/521[100%] M2_Stg1_job0:0/1/1[100%] J9_1_2_Stg5_job0:0/523/523[100%] J3_1_2_Stg1_job0:0/523/523[100%] R6_3_9_Stg2_job0:1/1046/1047(+1 backups)[100%] 
M1_Stg1_job0:0/521/521[100%] M2_Stg1_job0:0/1/1[100%] J9_1_2_Stg5_job0:0/523/523[100%] J3_1_2_Stg1_job0:0/523/523[100%] R6_3_9_Stg2_job0:1/1046/1047(+1 backups)[100%]

A total of 1,047 reducers are used. Among these reducers, 1,046 reducers have completed their calculations, but the last one has not. After MaxCompute detects this issue, it automatically starts a new reducer, calculates the same data, and then aggregates the results of the reducer that completed the calculation earlier to the final result set.

Optimize business logic to handle long tails

The aforementioned optimization methods cannot handle all the long tails. In some cases, you must optimize your business logic to handle long tails.

  • A large amount of noisy data may exist in calculations. For example, you need to calculate the data based on visitor IDs to check the access records of each user. In this case, you must filter out crawler data. Otherwise, a long tail may occur due to the crawler data during calculation. It is increasingly difficult to identify crawler data. Similarly, if you want to use the xxid field for associations, you must check whether the associated field is empty.
  • Long tails may occur in some special business scenarios. For example, the operation records of independent software vendors (ISVs) are greatly different from those of individuals in terms of the amount of data and behavior. In this case, you must use specific analysis methods to handle the issues of important customers.
  • If data is unevenly distributed, we recommend that you do not use constants as the key of DISTRIBUTE BY to sort all the data records.