This topic describes how to explicitly specify a MAPJOIN
hint in a SELECT
statement to join
a large table with one or more small tables. The MAPJOIN hint
speeds up your data queries.
Description
A JOIN operation involves three stages: map, shuffle, and reduce. In most cases, tables
are joined
in the reduce stage.
MAPJOIN
joins tables in the map stage instead of the reduce stage. This accelerates data
transmission, reduces system resource consumption, and optimizes the performance of
jobs.
In the map stage, MAPJOIN
loads all data in the specified small tables into the memory of the program that
performs the JOIN
operation. This improves the efficiency of the JOIN
operation.
In MaxCompute SQL, you cannot use non-equi joins
or the OR
logic in the ON
condition. However, you can do this in MAPJOIN
.
Limits
MAPJOIN
:
- In the map stage,
MAPJOIN
loads all data in the specified tables into the memory of the program that performs the JOIN operation. The tables specified for MAPJOIN must be small tables, and the total memory occupied by the table data cannot exceed 512 MB. MaxCompute compresses your data before storage. After the tables are loaded into the memory, the data volume of small tables sharply increases. 512 MB indicates the maximum data volume after small tables are loaded into the memory. - Limits on
JOIN
operations inMAPJOIN
:- The left table in a
LEFT OUTER JOIN
operation must be a large table. - The right table in a
RIGHT OUTER JOIN
operation must be a large table. - MAPJOIN cannot be used in a
FULL OUTER JOIN
operation. - The left or right table in an
INNER JOIN
operation can be a large table.
- The left table in a
- MaxCompute allows you to specify a maximum of 128 small tables for
MAPJOIN
. If you specify more than 128 small tables, a syntax error is returned.
Usage notes
You can execute MAPJOIN
only after you add the /*+ mapjoin(<table_name>) */
hint to a SELECT
statement. Take note of the following items:
- When you reference a small table or a subquery, you must reference the alias of the table or subquery.
- In
MAPJOIN
, you can use subqueries as small tables. - In
MAPJOIN
, you can use non-equi joins or combine conditions by usingOR
. You can calculate the Cartesian product by usingMAPJOIN ON 1 = 1
without specifying theON
condition, for example,select /*+ mapjoin(a) */ a.id from shop a join table_name b on 1=1;
. However, this calculation method may increase the data volume. - Separate multiple small tables in
MAPJOIN
with commas (,), for example,/*+ mapjoin(a,b,c)*/
.
Sample data
-- Create the sale_detail and sale_detail_sj tables. The two tables are partitioned tables.
create table if not exists sale_detail
(
shop_name string,
customer_id string,
total_price double
)
partitioned by (sale_date string, region string);
create table if not exists sale_detail_sj
(
shop_name string,
customer_id string,
total_price double
)
partitioned by (sale_date string, region string);
-- Add one partition to each table.
alter table sale_detail add partition (sale_date='2013', region='china');
alter table sale_detail_sj add partition (sale_date='2013', region='china');
-- Insert data into the two tables.
insert into sale_detail partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
insert into sale_detail_sj partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2),('s2','c2',100.2);
Example
JOIN
operations on the sale_detail and the sale_detail_sj tables. The operations must
meet one of the following conditions: 1. The total values of the total_price column
in the sale_detail_sj table are less than the total values of the total_price column
in the sale_detail table. 2. The sum of values of the total_price column in the sale_detail_sj
table and values of the total_price column in the sale_detail table is less than 500.
Sample statement:select /*+ mapjoin(a) */
a.shop_name,
a.total_price,
b.total_price
from sale_detail_sj a join sale_detail b
on a.total_price < b.total_price or a.total_price + b.total_price < 500;
The following result is returned:+-----------+-------------+--------------+
| shop_name | total_price | total_price2 |
+-----------+-------------+--------------+
| s1 | 100.1 | 100.1 |
| s2 | 100.2 | 100.1 |
| s5 | 100.2 | 100.1 |
| s2 | 100.2 | 100.1 |
| s1 | 100.1 | 100.2 |
| s2 | 100.2 | 100.2 |
| s5 | 100.2 | 100.2 |
| s2 | 100.2 | 100.2 |
| s1 | 100.1 | 100.3 |
| s2 | 100.2 | 100.3 |
| s5 | 100.2 | 100.3 |
| s2 | 100.2 | 100.3 |
+-----------+-------------+--------------+