SkewJoin Hint的使用方法和实现原理 - MaxCompute

If two tables that you want to join contain hot key values, a long tail issue may occur. You can extract hot key values from the two tables, separately calculate the join result of hot key values and the join result of non-hot key values, and then join the calculated data. In this case, SKEWJOIN HINT can be used to automatically or manually extract hot key values from the two tables, separately calculate the join results of the hot key values and non-hot key values, and then join the calculated data. This way, the JOIN operation is accelerated.

Usage

SKEWJOIN HINT can be performed only after you add the skew join hint /*+ skewJoin(<table_name>[(<column1_name>[,<column2_name>,...])][((<value11>,<value12>)[,(<value21>,<value22>)...])]*/ to the SELECT statement. In this hint, table_name indicates the name of a skewed table, column_name indicates the name of a skewed column, and value indicates a skewed key value.

-- Method 1: Include the alias of the table in SKEWJOIN HINT. 
select /*+ skewjoin(a) */ * from T0 a join T1 b on a.c0 = b.c0 and a.c1 = b.c1;
-- Method 2: Add a hint to specify the name of the table and the names of possibly skewed columns. For example, the following statement shows that Columns c0 and c1 in Table a are skewed columns. 
select /*+ skewjoin(a(c0, c1)) */ * from T0 a join T1 b on a.c0 = b.c0 and a.c1 = b.c1 and a.c2 = b.c2;
-- Method 3: Add a hint to specify the name of table and the names of columns and provide the skewed key values. If skewed key values are of the STRING type, enclose each value with double quotation marks ("). In the following statement, (a.c0=1 and a.c1="2") and (a.c0=3 and a.c1="4") contain skewed key values. 
select /*+ skewjoin(a(c0, c1)((1, "2"), (3, "4"))) */ * from T0 a join T1 b on a.c0 = b.c0 and a.c1 = b.c1 and a.c2 = b.c2;

Note

Method 3 is more efficient than Method 1 and Method 2.

Implementation

Hot key values are the key values that appear multiple times in a table. In the following figure, the red part has 10,000 records of a.c0=1 and a.c1=2 and 9,000 records of a.c0=3 and a.c1 = 4. 热值key

If SKEWJOIN HINT is not used and the T0 and T1 tables contain large amounts of data, only the MERGE JOIN statement can be executed on the two tables. In this case, the same hot key values are shuffled to a single node. As a result, data skew occurs. After SKEWJOIN HINT is used, the optimizer runs an Aggregate to dynamically obtain top 20 hot key values based on the number of their records. The optimizer separately extracts a hot key value (data A) and a non-hot key value (data B) from the T0 table, and separately extracts a value (data C) that can be joined with data A and a value (data D) that cannot be joined with data A from the T1 table. Then, the MAP JOIN statement is executed on data A and data C (with a small data volume) and the MERGE JOIN statement is executed on data B and data D. The UNION statement is executed on the results of the MAP JOIN and MERGE JOIN statements to generate the final result, as shown in the following figure. SkewJoin Hint

Precautions

When you use SKEWJOIN HINT for JOIN statements, take note of the following limits:
- INNER JOIN: Skew join hints can be specified for the left or right table in the INNER JOIN statement.
- LEFT JOIN, SEMI JOIN, and ANTI JOIN: Skew join hints can be specified only for the left table.
- RIGHT JOIN: Skew join hints can be specified only for the right table.
- FULL JOIN: Skew join hints are not supported.
An Aggregate is run after a skew join hint is added, which slows down the join operation. Therefore, we recommend that you add a skew join hint only to the JOIN statements that cause data skew. If a skew join hint is added to a query of A JOIN B for Table A, a physical execution plan similar to MapJoin union all MergeJoin is forcefully generated. The expanded MAPJOIN subplan is similar to: Top 20(A) MapJoin (B Semi Join Top20(A)). If Table B still contains a large amount of data after data is filtered by using Top20(A), an out of memory (OOM) error may occur when you use MAPJOIN to create a hash join and a hash table.
The data type of Left Side Join Key must be the same as that of Right Side Join Key for the JOIN statement to which SKEWJOIN HINT is added. If the data types are different, SKEWJOIN HINT becomes ineffective. In the preceding example, the data types of a.c0 and b.c0 must be the same and the data types of a.c1 and b.c1 must be the same. To ensure data type consistency, you can use the CAST function to convert the join keys in subqueries. The following code shows an example:
```
create table T0(c0 int, c1 int, c2 int, c3 int);
create table T1(c0 string, c1 int, c2 int);
-- Method 1:
select /*+ skewjoin(a) */ * from T0 a join T1 b on cast(a.c0 as string) = cast(b.c0 as string) and a.c1 = b.c1;
-- Method 2:
select /*+ skewjoin(b) */ * from (select cast(a.c0 as string) as c00 from T0 a) b join T1 c on b.c00 = c.c0;
```
After a skew join hint is added, the optimizer runs an Aggregate to obtain the first N hot key values. You can run the set odps.optimizer.skew.join.topk.num = xx; command to specify the number of hot key values that the optimizer can obtain.
Skew join hints allow you to add a hint only for the left or right table involved in a JOIN statement.
In the JOIN statement to which SKEWJOIN HINT is added, left key = right key must be included. SKEWJOIN HINT cannot be added to the CARTESIAN JOIN statement.
The following statement shows how to use SKEWJOIN HINT with other hints. Note that SKEWJOIN HINT cannot be added to the JOIN statement to which MAPJOIN HINT is added.
```
select /*+ mapjoin(c), skewjoin(a) */ * from T0 a join T1 b on a.c0 = b.c3 join T2 c on a.c0 = c.c7;
```
On the Json Summary tab of LogView, you can search for the topk_agg field. If such a field exists, as shown in the following figure, SKEWJOIN HINT has taken effect.