• Where condition in Join statement
    When you join two tables, the Where condition of the primary table can be written at the end of the statement, but the restriction condition of the partition in the secondary table cannot be written in the Where condition. We recommend that you write it in the ON condition or subquery. The partition restrictions of the primary table can be written in the Where condition (it is better to filter by subquery first).  Several SQL examples are as follows:
    select * from A join (select * from B where dt=20150301)B on B.id=A.id where A.dt=20150301; 
    select * from A join B on B.id=A.id where B.dt=20150301; --Not allowed. 
    select * from (select * from A where dt=20150301)A join (select * from B where dt=20150301)B on B.id=A.id;

    The Join operation in the second statement runs first, data volume becomes larger and the performance can be decreased. Therefore, the second statement must be avoided.

  • Data skew

    The root cause of data skew is that the amount of data processed by some Workers is much larger than that of other Workers. This means running hours of some Workers are higher than the average, which leads to job delay.

    .

    • Data skew caused by Join
      Data can be skewed by a Join operation when the Join key distribution is uneven. For the preceding example, to join a large table A and a small table B, run the following statement: For the preceding example, to join a large table A and a small table B, run the following statement:
      select * from A join B on A.value= B.value;
      Copy the logview link to enter the web console page, and double-click the Fuxi job that runs the Join operation. You can see a long tail in the Long-Tails tab, which indicates that the data has been skewed, as shown in the following figure: 

      You can optimize the statement by the following methods:
      • Since table B is a small table and does not exceed 512 MB, you can optimize the preceding statement into mapjoin statement.
        select /*+ MAPJOIN(B) */ * from A join B on A.value= B.value;
      • Handle the skewed key with a separate logic. For example, a large number of null key values in both tables will usually cause data skew.  It is necessary to filter out the null data or add a random number before the Join operation, for example:
        select * from A join B
        on case when A.value is null then concat('value',rand() ) else A.value end = B.value;
    If you know that the data is skewed, but you cannot work out what is causing it, a general solution can be used to test the data skew. See the following example:
    
    select * from a join b on a.key=b.key; --This Leads to data skew. 
    Now you can run the following statements:  
    ```sql
    select left.key, left.cnt * right.cnt from 
    (select key, count(*) as cnt from a group by key) left 
    join
    (select key, count(*) as cnt from b group by key) right
    on left.key=right.key;

    Check the distribution of keys to discover whether data skew happens when A joins B.

  • Group by skew

    Group by skewing can be caused when the key distribution of group by is uneven.

    Suppose a table A has two fields: key and value. The data volume in the table is large enough, and the value distribution of key is uneven. Run the following statement:
    select key,count(value) from A group by key;

    You can see the long tail on the web console page. To solve this problem, you must set the anti-skew parameters before running SQL statement set odps.sql.groupby.skewindata=true must be added into the SQL statement.

  • Data skew caused by incorrect use of dynamic partitions
    Dynamic partitions of SQL in MaxCompute add a Reduce function by default, which is used to merge the same partition data. The benefits are as following.
    • Reduce small files generated by MaxCompute and improve the efficiency of processing.
    • Reduce the memory occupied when a Worker outputs many files.
    When partition data is skewed, using the Reduce function lead to the appearance of long tails. The same data can only be processed by a maximum of 10 Workers, so large volume of data results in long tails, for example:
    
    insert overwrite table A2 partition(dt) 
    select
    split_part(value,'\t',1) as field1,
    split_part(value,'\t',2) as field2, 
    dt 
    from A 
    where dt='20151010';
    In this case, we recommend that you do not use dynamic partition, and modify the statement in the following way:
    
    insert overwrite table A2 partition(dt='20151010') 
    select
    split_part(value,'\t',1) as field1,
    split_part(value,'\t',2) as field2
    from A 
    where dt='20151010';
  • Window function optimization
    If you use window functions in your SQL statement, each window function typically forms a Reduce job. If window functions are too many, they consume resources. In some specific scenarios, you can optimize window functions.
    • The content after the over keyword must be the same, with the similar grouping and sorting conditions.
    • Multiple window functions must run on the same SQL layer.
    Window functions that meet these two conditions merge into Reduce implementation. An SQL example is as follows:
    
    select
    rank()over(partition by A order by B desc) as rank,
    row_number()over(partition by A order by B desc) as row_num
    from MyTable;
  • Convert the subquery to Join
    A subquery is shown as follows:
    SELECT * FROM table_a a WHERE a.col1 IN (SELECT col1 FROM table_b b WHERE xxx);
    If the number of col1 returned by the table_b subquery in this statement exceeds 1,000, the system reports an error: rrecords returned from subquery exceeded limit of 1,000. In this case, you can use the Join statement instead:
    SELECT a. * FROM table_a a JOIN (SELECT DISTINCT col1 FROM table_b b WHERE xxx) c ON (a.col1 = c.col1)
    Note
    • If there is no Distinct keyword in the statement, and the result of the subquery c returns the same col1 value, it may cause the larger number of results of table_a.
    • The Distinct subquery can lead the whole query to fall into one Worker. If the subquery data is large, it may cause the whole query to be slower.If you have already made sure the col1 values are distinct in the subquery from the business, for example, by querying the primary key field, then performance can only be improved by removing the Distinct keyword.
    • If you have already made sure the col1 values are distinct in the subquery from the business, for example, querying by the primary key field, to improve performance the Distinct keyword can only be removed.