Add a MAPJOIN hint to a SELECT statement to force a join to execute in the map stage, bypassing the shuffle and reduce stages. This reduces data transmission overhead and improves query performance when joining a large table with one or more small tables. Use MAPJOIN when the optimizer does not automatically apply map-stage join to your query.
How it works
A standard JOIN in MaxCompute runs through three stages: map, shuffle, and reduce. The actual join logic executes in the reduce stage, which requires shuffling data across nodes.
MAPJOIN changes this: it loads the entire contents of the specified small table into memory during the map stage. Each mapper then performs the join locally against the in-memory data, eliminating the shuffle and reduce stages entirely.
The hint names the small table to load into memory. For every mapper reading rows from the large table, the small table is read completely from memory:
SELECT /*+ mapjoin(a) */ a.shop_name, a.total_price, b.total_price
FROM sale_detail_sj a JOIN sale_detail b
ON a.total_price < b.total_price OR a.total_price + b.total_price < 500;In this example, a (aliased from sale_detail_sj) is the small table. The hint specifies a, so a is what gets loaded into memory on each mapper.
Limits
Memory and table count
| Constraint | Limit | Notes |
|---|---|---|
| Total memory for all small tables | 512 MB | Measured after data is loaded into memory and decompressed, not the compressed size stored in MaxCompute |
| Maximum number of small tables | 128 | Specifying more than 128 returns a syntax error |
Supported JOIN types
| JOIN type | Supported | Requirement |
|---|---|---|
| INNER JOIN | Yes | Either the left or right table can be the large table |
| LEFT OUTER JOIN | Yes | The left table must be the large table |
| RIGHT OUTER JOIN | Yes | The right table must be the large table |
| FULL OUTER JOIN | No | — |
Usage notes
Add /*+ mapjoin(<table_name>) */ immediately after SELECT. Keep the following in mind:
Reference aliases, not original table names. When the small table or subquery has an alias, use the alias inside the hint.
Subqueries are supported as small tables. Use a subquery in place of a table reference, and reference its alias in the hint.
Separate multiple small tables with commas:
/*+ mapjoin(a,b,c) */Non-equi joins and OR conditions are supported. Standard MaxCompute SQL does not allow non-equi joins or OR logic in the ON condition, but MAPJOIN does.
Cartesian products are supported using
ON 1 = 1(for example,SELECT /*+ mapjoin(a) */ a.id FROM shop a JOIN table_name b ON 1=1), but this can significantly increase output data volume.
Subquery types such as SCALAR, IN, NOT IN, EXISTS, and NOT EXISTS may be converted to JOIN operations at execution time. If the subquery result qualifies as a small table, add a MAPJOIN hint to the subquery statement to explicitly apply the map-stage join algorithm.
Sample data
The examples in this topic use the sale_detail and sale_detail_sj tables. Run the following statements to create the tables and insert sample data.
-- Create a partitioned table named sale_detail.
CREATE TABLE IF NOT EXISTS sale_detail
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
CREATE TABLE IF NOT EXISTS sale_detail_sj
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
-- Add partitions.
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china');
ALTER TABLE sale_detail_sj ADD PARTITION (sale_date='2013', region='china');
-- Insert sample data.
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china')
VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
INSERT INTO sale_detail_sj PARTITION (sale_date='2013', region='china')
VALUES ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2),('s2','c2',100.2);Example
Join sale_detail_sj (small table, aliased as a) with sale_detail (large table, aliased as b) using a non-equi condition. Return rows where either the total_price from a is less than the total_price from b, or the sum of both prices is less than 500.
-- Allow a full scan on the partitioned table.
SET odps.sql.allow.fullscan=true;
-- Use MAPJOIN with a non-equi join condition.
SELECT /*+ mapjoin(a) */
a.shop_name,
a.total_price,
b.total_price
FROM sale_detail_sj a JOIN sale_detail b
ON a.total_price < b.total_price OR a.total_price + b.total_price < 500;The query returns the following result:
+-----------+-------------+--------------+
| shop_name | total_price | total_price2 |
+-----------+-------------+--------------+
| s1 | 100.1 | 100.1 |
| s2 | 100.2 | 100.1 |
| s5 | 100.2 | 100.1 |
| s2 | 100.2 | 100.1 |
| s1 | 100.1 | 100.2 |
| s2 | 100.2 | 100.2 |
| s5 | 100.2 | 100.2 |
| s2 | 100.2 | 100.2 |
| s1 | 100.1 | 100.3 |
| s2 | 100.2 | 100.3 |
| s5 | 100.2 | 100.3 |
| s2 | 100.2 | 100.3 |
+-----------+-------------+--------------+