All Products
Search
Document Center

MaxCompute:SKEWJOIN HINT

Last Updated:Mar 26, 2026

When a JOIN query processes two large tables that share hot key values — values that appear thousands of times — the optimizer shuffles all matching rows to a single node. That node becomes a bottleneck while others sit idle, a pattern called a long tail issue. SKEWJOIN HINT resolves this by splitting the data: hot key rows go through MAP JOIN on a small dataset, and non-hot key rows go through MERGE JOIN in parallel. The results are merged with UNION into a final output.

How it works

Without SKEWJOIN HINT, joining two large tables falls back to MERGE JOIN, which routes all rows sharing the same key to one reducer. When a key appears 10,000 or 9,000 times (as shown below), that reducer takes far longer than the rest.

Hot key values diagram

With SKEWJOIN HINT, the optimizer runs an Aggregate to identify the top N hot key values, then splits each input table:

  • Data A — hot key rows from the left table

  • Data B — non-hot key rows from the left table

  • Data C — rows from the right table that join with data A

  • Data D — rows from the right table that do not join with data A

MAP JOIN runs on data A × data C (small volume). MERGE JOIN runs on data B × data D. UNION combines both results.

SKEWJOIN HINT execution flow

Syntax

Add the hint inside the SELECT statement:

SELECT /*+ skewJoin(<table_name>[(<col1>[,<col2>...])][((<val11>,<val12>)[,(<val21>,<val22>)...])] */ ...
ParameterDescription
table_nameAlias or name of the skewed table
col_nameName of a skewed column
valueA skewed key value. Enclose STRING values in double quotation marks (")

Choose a method

Three methods are available. Use the method that matches how much you know about your skewed data.

Method 1 — table alias only (automatic detection)

The optimizer detects the top 20 hot key values at runtime.

-- The optimizer identifies skewed keys in table a automatically.
SELECT /*+ skewjoin(a) */ *
FROM T0 a JOIN T1 b ON a.c0 = b.c0 AND a.c1 = b.c1;

Use this when you have not profiled the data and do not know which specific values are skewed.

Method 2 — table alias + column names

The optimizer limits detection to the specified columns.

-- Columns c0 and c1 in table a are the skewed columns.
SELECT /*+ skewjoin(a(c0, c1)) */ *
FROM T0 a JOIN T1 b ON a.c0 = b.c0 AND a.c1 = b.c1 AND a.c2 = b.c2;

Use this when you know which columns are skewed but not the exact values.

Method 3 — table alias + column names + values (most efficient)

You supply the skewed key values directly.

-- (a.c0=1, a.c1="2") and (a.c0=3, a.c1="4") are the skewed key combinations.
SELECT /*+ skewjoin(a(c0, c1)((1, "2"), (3, "4"))) */ *
FROM T0 a JOIN T1 b ON a.c0 = b.c0 AND a.c1 = b.c1 AND a.c2 = b.c2;

Use this when you have already identified the skewed values. Method 3 is more efficient than Methods 1 and 2.

Limitations

Supported JOIN types

JOIN typeSupported table
INNER JOINLeft or right table
LEFT JOINLeft table only
SEMI JOINLeft table only
ANTI JOINLeft table only
RIGHT JOINRight table only
FULL JOINNot supported

Other constraints

  • One table per hint: Specify a hint for either the left or the right table — not both — in a single JOIN clause.

  • Equi-join required: The JOIN condition must include at least one left key = right key predicate. SKEWJOIN HINT cannot be used with CARTESIAN JOIN.

  • No MAPJOIN HINT on the same table: SKEWJOIN HINT and MAPJOIN HINT cannot target the same table. Applying both to different tables in the same query is valid:

    SELECT /*+ mapjoin(c), skewjoin(a) */ *
    FROM T0 a JOIN T1 b ON a.c0 = b.c3
              JOIN T2 c ON a.c0 = c.c7;
  • Matching join key data types: The data type of the left join key must match the data type of the right join key. If they differ, SKEWJOIN HINT has no effect. Use the CAST function to align types:

    CREATE TABLE T0 (c0 INT, c1 INT, c2 INT, c3 INT);
    CREATE TABLE T1 (c0 STRING, c1 INT, c2 INT);
    
    -- Option 1: cast both sides inline
    SELECT /*+ skewjoin(a) */ *
    FROM T0 a JOIN T1 b
      ON CAST(a.c0 AS STRING) = CAST(b.c0 AS STRING) AND a.c1 = b.c1;
    
    -- Option 2: cast in a subquery
    SELECT /*+ skewjoin(b) */ *
    FROM (SELECT CAST(a.c0 AS STRING) AS c00 FROM T0 a) b
    JOIN T1 c ON b.c00 = c.c0;

Usage notes

Performance trade-off

Adding SKEWJOIN HINT introduces a runtime Aggregate, which adds overhead. Apply the hint only to JOIN statements that exhibit data skew; adding it to balanced queries slows them down without benefit.

The physical execution plan takes the form MapJoin UNION ALL MergeJoin. The expanded MAP JOIN subplan is:

Top N(A) MapJoin (B Semi Join Top N(A))

If table B remains large after filtering by the top N hot keys, the MAP JOIN hash table build may exhaust memory and produce an out of memory (OOM) error.

Configure the top-N threshold

By default, the optimizer identifies the top 20 hot key values. To change this limit, run:

SET odps.optimizer.skew.join.topk.num = <N>;
ParameterDefaultDescription
odps.optimizer.skew.join.topk.num20Number of hot key values the optimizer identifies when Methods 1 or 2 are used

Verify SKEWJOIN HINT took effect

After the query completes, open LogView and navigate to the Json Summary tab. Search for the topk_agg field. If the field is present, SKEWJOIN HINT is active.

LogView Json Summary topk_agg field

FAQ

Why didn't SKEWJOIN HINT improve my query?

The most likely cause is a join key data type mismatch. If the left and right join keys have different data types, the hint silently has no effect. Check that all paired join keys have the same type, and use CAST to align them if needed.

A second possibility is that the hint is applied to a JOIN type that does not support it, such as FULL JOIN, or that MAPJOIN HINT is already applied to the same table.

The query is slower after adding the hint. What happened?

The Aggregate step that identifies hot keys adds overhead. If the JOIN was not actually skewed, the extra work outweighs any benefit. Remove the hint from queries where data is evenly distributed, and add it only where you observe a long tail.

What's next

  • Query optimizer — learn how the MaxCompute optimizer generates physical execution plans