This topic describes common scenarios where you can optimize SQL statements to achieve better performance.

Reduce impacts of data skew

Data skew can lead to an extreme imbalance of work. When data is skewed, some workers need to process larger amounts of data than the others. As a result, these workers take much longer to complete. This prolongs the overall time used to process data and may lead to latency.
  • Skewed joins
    An imbalance of work may occur when you join tables based on a key that is not evenly distributed. For example, execute the following statement to join a large table named A and a small table named B:
    select * from A join B on A.value= B.value;
    Copy the Logview URL of the query and open it in a browser to go to the Logview page. Double-click the Job Scheduler job that performs the JOIN operation. On the Long-tails tab, you can see that long tails exist, as shown in the following figure. This indicates that data is skewed.
    To optimize the preceding statement, you can use one of the following methods:
    • Use a MAPJOIN statement. Table B is a small table which does not exceed 512 MB in size. In this case, you can replace the preceding statement with the following statement:
      select /*+ MAPJOIN(B) */ * from A join B on A.value= B.value;
    • Handle the skewed key separately. If data skew occurs because a large number of null key values exist in both tables, you must filter out these null values or generate random numbers to replace them before you perform the JOIN operation. For example, you can replace the preceding statement with the following statement:
      select * from A join B
      on case when A.value is null then concat('value',rand() ) else A.value end = B.value;
      The following example describes how to identify the key values that cause data skew:
      -- Data skew leads to an imbalance of work when the following statement is executed:
      select * from a join b on a.key=b.key;  
      -- Execute the following statement to view the distribution of key values and identify the key values that cause data skew:
      select left.key, left.cnt * right.cnt from 
      (select key, count(*) as cnt from a group by key) left 
      join
      (select key, count(*) as cnt from b group by key) right
      on left.key=right.key;
  • Skewed GROUP BY operations

    An imbalance of work may occur when you perform a GROUP BY operation based on a key that is not evenly distributed.

    Assume that Table A has two fields, which are Key and Value. The table contains a large amount of data and the values of the Key field are not evenly distributed. Execute the following statement to perform a GROUP BY operation on the table:
    select key,count(value) from A group by key;

    When the amount of the data in the table is large enough, you may find long trails on the Logview page of the query. To resolve the issue, add set odps.sql.groupby.skewindata=true before the preceding statement to enable anti-skew before the query is performed.

  • Skewed reduce tasks during improper use of dynamic partitioning
    If you use dynamic partitioning in MaxCompute, one or more reduce tasks are assigned to each partition to aggregate data by partition. This brings the following benefits:
    • Reduce the number of small files generated by MaxCompute and improve the efficiency of processing.
    • Avoid high memory usage when a worker needs to write many files to a partition.
    If the data to be written to partitions is skewed, long tails may occur during the reduce stage. Each partition can be assigned a maximum of 10 map tasks. If a larger amount of data is to be written to a partition than the other partitions, long tails may occur. If you can determine the partition to which data is to be written, we recommend that you do not use dynamic partitioning. For example, long tails may occur if you execute the following statement to write data from a specific partition in a table to another table:
    insert overwrite table A2 partition(dt) select split_part(value,'\t',1) as field1, split_part(value,'\t',2) as field2, dt from A where dt='20151010';
    In this case, you can replace the preceding statement with the following statement:
    insert overwrite table A2 partition(dt='20151010') 
    select
    split_part(value,'\t',1) as field1,
    split_part(value,'\t',2) as field2
    from A 
    where dt='20151010';

For more information about how to reduce impacts of data skew, see Optimize long tail computing.

Optimize window functions

If window functions are used in SQL statements, a reduce task is assigned to each window function. A large number of window functions consume a large amount of resources. You can optimize the window functions that meet both of the following conditions:
  • The OVER clause which defines how to partition and sort rows in a table must be the same.
  • Multiple window functions must be executed at the same level of nesting in an SQL statement.
The window functions that meet the preceding conditions are merged to be executed by one reduce task. The following SQL statement provides an example:
select
rank()over(partition by A order by B desc) as rank,
row_number()over(partition by A order by B desc) as row_num
from MyTable;

Optimize subqueries

The following statement contains a subquery:
SELECT * FROM table_a a WHERE a.col1 IN (SELECT col1 FROM table_b b WHERE xxx);
If the subquery on the table_b table returns more than 1,000 values from the col1 column, the system reports the following error: records returned from subquery exceeded limit of 1000. In this case, you can replace the preceding statement with the following statement:
SELECT a. * FROM table_a a JOIN (SELECT DISTINCT col1 FROM table_b b WHERE xxx) c ON (a.col1 = c.col1)
Note
  • If the DISTINCT keyword is not used, the subquery result table c may contain duplicate values in the col1 column. In this case, the query on the a table returns more results.
  • If the DISTINCT keyword is used, only one worker is assigned to perform the subquery. If the subquery involves a large amount of data, the whole query slows down.
  • If you are sure that the values that meet the subquery conditions in the col1 column are unique, you can delete the DISTINCT keyword to improve the query performance.

Optimize joins

When you join two tables, we recommend that you use the WHERE clause based on the following rules:
  • Specify the partition limits of the primary table in the WHERE clause. We recommend that you define a subquery for the primary table to obtain the required data first.
  • Write the WHERE clause of the primary table at the end of the statement.
  • Specify the partition limits of the secondary table in the ON clause or a subquery.
Examples:
select * from A join (select * from B where dt=20150301)B on B.id=A.id where A.dt=20150301;
select * from A join B on B.id=A.id where B.dt=20150301; -- We recommend that you do not use this statement. The system performs the JOIN operation before it performs partition pruning. This can result in a large amount of data and deteriorate the query performance.
select * from (select * from A where dt=20150301)A join (select * from B where dt=20150301)B on B.id=A.id;