This topic describes common scenarios where you can optimize SQL statements to achieve better performance.
Reduce impacts of data skew
- Skewed joins
An imbalance of work may occur when you join tables based on a key that is not evenly distributed. For example, execute the following statement to join a large table named A and a small table named B:
select * from A join B on A.value= B.value;
Copy the Logview URL of the query and open it in a browser to go to the Logview page. Double-click the Job Scheduler job that performs the JOIN operation. On the Long-tails tab, you can see that long tails exist, as shown in the following figure. This indicates that data is skewed.To optimize the preceding statement, you can use one of the following methods:- Use a MAPJOIN statement. Table B is a small table which does not exceed 512 MB in
size. In this case, you can replace the preceding statement with the following statement:
select /*+ MAPJOIN(B) */ * from A join B on A.value= B.value;
- Handle the skewed key separately. If data skew occurs because a large number of null
key values exist in both tables, you must filter out these null values or generate
random numbers to replace them before you perform the JOIN operation. For example,
you can replace the preceding statement with the following statement:
select * from A join B on case when A.value is null then concat('value',rand() ) else A.value end = B.value;
The following example describes how to identify the key values that cause data skew:-- Data skew leads to an imbalance of work when the following statement is executed: select * from a join b on a.key=b.key; -- Execute the following statement to view the distribution of key values and identify the key values that cause data skew: select left.key, left.cnt * right.cnt from (select key, count(*) as cnt from a group by key) left join (select key, count(*) as cnt from b group by key) right on left.key=right.key;
- Use a MAPJOIN statement. Table B is a small table which does not exceed 512 MB in
size. In this case, you can replace the preceding statement with the following statement:
- Skewed GROUP BY operations
An imbalance of work may occur when you perform a GROUP BY operation based on a key that is not evenly distributed.
Assume that Table A has two fields, which are Key and Value. The table contains a large amount of data and the values of the Key field are not evenly distributed. Execute the following statement to perform a GROUP BY operation on the table:select key,count(value) from A group by key;
When the amount of the data in the table is large enough, you may find long trails on the Logview page of the query. To resolve the issue, add
set odps.sql.groupby.skewindata=true
before the preceding statement to enable anti-skew before the query is performed. - Skewed reduce tasks during improper use of dynamic partitioning
If you use dynamic partitioning in MaxCompute, one or more reduce tasks are assigned to each partition to aggregate data by partition. This brings the following benefits:
- Reduce the number of small files generated by MaxCompute and improve the efficiency of processing.
- Avoid high memory usage when a worker needs to write many files to a partition.
If the data to be written to partitions is skewed, long tails may occur during the reduce stage. Each partition can be assigned a maximum of 10 map tasks. If a larger amount of data is to be written to a partition than the other partitions, long tails may occur. If you can determine the partition to which data is to be written, we recommend that you do not use dynamic partitioning. For example, long tails may occur if you execute the following statement to write data from a specific partition in a table to another table:insert overwrite table A2 partition(dt) select split_part(value,'\t',1) as field1, split_part(value,'\t',2) as field2, dt from A where dt='20151010';
In this case, you can replace the preceding statement with the following statement:insert overwrite table A2 partition(dt='20151010') select split_part(value,'\t',1) as field1, split_part(value,'\t',2) as field2 from A where dt='20151010';
For more information about how to reduce impacts of data skew, see Optimize long tail computing.
Optimize window functions
- The OVER clause which defines how to partition and sort rows in a table must be the same.
- Multiple window functions must be executed at the same level of nesting in an SQL statement.
select
rank()over(partition by A order by B desc) as rank,
row_number()over(partition by A order by B desc) as row_num
from MyTable;
Optimize subqueries
SELECT * FROM table_a a WHERE a.col1 IN (SELECT col1 FROM table_b b WHERE xxx);
records returned from subquery exceeded limit of 1000
. In this case, you can replace the preceding statement with the following statement:
SELECT a. * FROM table_a a JOIN (SELECT DISTINCT col1 FROM table_b b WHERE xxx) c ON (a.col1 = c.col1)
- If the DISTINCT keyword is not used, the subquery result table c may contain duplicate values in the col1 column. In this case, the query on the a table returns more results.
- If the DISTINCT keyword is used, only one worker is assigned to perform the subquery. If the subquery involves a large amount of data, the whole query slows down.
- If you are sure that the values that meet the subquery conditions in the col1 column are unique, you can delete the DISTINCT keyword to improve the query performance.
Optimize joins
- Specify the partition limits of the primary table in the WHERE clause. We recommend that you define a subquery for the primary table to obtain the required data first.
- Write the WHERE clause of the primary table at the end of the statement.
- Specify the partition limits of the secondary table in the ON clause or a subquery.
select * from A join (select * from B where dt=20150301)B on B.id=A.id where A.dt=20150301;
select * from A join B on B.id=A.id where B.dt=20150301; -- We recommend that you do not use this statement. The system performs the JOIN operation before it performs partition pruning. This can result in a large amount of data and deteriorate the query performance.
select * from (select * from A where dt=20150301)A join (select * from B where dt=20150301)B on B.id=A.id;