If you want to join a large table and multiple small tables, you can explicitly specify the MAPJOIN hint in a SELECT statement to improve query performance. This topic describes how to use the MAPJOIN hint to join tables.
Description
A JOIN operation involves three stages: map, shuffle, and reduce. In most cases, tables are joined in the reduce stage.
MAPJOIN joins tables in the map stage instead of the reduce stage. This accelerates data transmission, reduces system resource consumption, and optimizes the performance of jobs.
In the map stage, MAPJOIN loads all data in the specified small tables into the memory of the program that performs the JOIN operation. This improves the efficiency of the JOIN operation.
In MaxCompute SQL, you cannot use non-equi joins or the OR logic in the ON condition. However, you can do this in MAPJOIN.
Limits
Limits on the use of MAPJOIN:
In the map stage,
MAPJOINloads all data in the specified tables into the memory of the program that performs the JOIN operation. The tables specified for MAPJOIN must be small tables, and the total memory occupied by the table data cannot exceed 512 MB. MaxCompute compresses your data before storage. After the small tables are loaded into the memory, the data volume of the tables sharply increases. 512 MB indicates the maximum data amount after small tables are loaded into the memory.NoteThe 512 MB here refers to the maximum memory size that a small table occupies after it is loaded into memory and decompressed, not the file size of the small table after it is compressed and stored in MaxCompute.
Limits on
JOINoperations inMAPJOIN:The left table in a
LEFT OUTER JOINoperation must be a large table.The right table in a
RIGHT OUTER JOINoperation must be a large table.MAPJOIN cannot be used in a
FULL OUTER JOINoperation.The left or right table in an
INNER JOINoperation can be a large table.
MaxCompute allows you to specify a maximum of 128 small tables for
MAPJOIN. If you specify more than 128 small tables, a syntax error is returned.
Usage notes
You can execute MAPJOIN only after you add the /*+ mapjoin(<table_name>) */ hint to a SELECT statement. Take note of the following items:
When you reference a small table or a subquery, you must reference the alias of the table or subquery.
In MAPJOIN, you can use subqueries as small tables.In
MAPJOIN, you can use non-equi joins or combine conditions by usingOR. You can calculate the Cartesian product by usingMAPJOIN ON 1 = 1without specifying theONcondition, such asselect /*+ mapjoin(a) */ a.id from shop a join table_name b on 1=1;. However, this calculation method may increase the data volume.Separate multiple small tables in
MAPJOINwith commas (,), such as/*+ mapjoin(a,b,c)*/.
Some subqueries such as SCALAR, IN, NOT IN, EXISTS, and NOT EXISTS can be converted into JOIN operations during execution. MAPJOIN is an efficient JOIN algorithm. If the result of the SUBQUERY is a small table, you can use a HINT in the subquery statement to explicitly specify the MAPJOIN algorithm.
Sample data
Sample source data is provided for you to better understand the examples in this topic. In this example, the sale_detail and sale_detail_sj tables are created, and data is inserted into the tables.
-- Create a partitioned table named sale_detail.
create table if not exists sale_detail
(
shop_name string,
customer_id string,
total_price double
)
partitioned by (sale_date string, region string);
create table if not exists sale_detail_sj
(
shop_name string,
customer_id string,
total_price double
)
partitioned by (sale_date string, region string);
-- Add partitions to the sale_detail and sale_detail_sj tables.
alter table sale_detail add partition (sale_date='2013', region='china');
alter table sale_detail_sj add partition (sale_date='2013', region='china');
-- Insert data into the sale_detail and sale_detail_sj tables.
insert into sale_detail partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
insert into sale_detail_sj partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2),('s2','c2',100.2);Example
Perform JOIN operations on the sale_detail and the sale_detail_sj tables. The operations must meet one of the following conditions: 1. The total values of the total_price column in the sale_detail_sj table are less than the total values of the total_price column in the sale_detail table. 2. The sum of values of the total_price column in the sale_detail_sj table and values of the total_price column in the sale_detail table is less than 500. Sample statement:
-- Allow a full scan on a partitioned table.
SET odps.sql.allow.fullscan=true;
-- Use MAPJOIN to perform a query.
select /*+ mapjoin(a) */
a.shop_name,
a.total_price,
b.total_price
from sale_detail_sj a join sale_detail b
on a.total_price < b.total_price or a.total_price + b.total_price < 500;The following result is returned:
+-----------+-------------+--------------+
| shop_name | total_price | total_price2 |
+-----------+-------------+--------------+
| s1 | 100.1 | 100.1 |
| s2 | 100.2 | 100.1 |
| s5 | 100.2 | 100.1 |
| s2 | 100.2 | 100.1 |
| s1 | 100.1 | 100.2 |
| s2 | 100.2 | 100.2 |
| s5 | 100.2 | 100.2 |
| s2 | 100.2 | 100.2 |
| s1 | 100.1 | 100.3 |
| s2 | 100.2 | 100.3 |
| s5 | 100.2 | 100.3 |
| s2 | 100.2 | 100.3 |
+-----------+-------------+--------------+