MaxCompute - MaxCompute SQL Allows You to Use Skew Join Hints to Handle Long Tails During Join Operations
Feb 08 2021
MaxComputeContent
Target customers: developers. Features released: If a hotspot exists during the join of two tables and a long tail occurs, you can extract the hotspot key to divide data into two parts, hotspot data and non-hotspot data, process the parts, and then merge the parts. This makes the join more efficient. Skew join hints can be used to manually or automatically obtain the hotspot keys of two tables and separately join hotspot data and non-hotspot data in an optimized manner. Then, skew join hints are used to automatically merge the results. Skew join hints help you handle long tails during join operations. If you join the T0 table with the T1 table without skew join hints and both the two tables have large amounts of data, only MergeJoin can be performed. As a result, the same values of hotspot data are shuffled to a node, which causes data skew. After skew join hints are specified, the optimizer runs an aggregate task to dynamically obtain the top 20 values of hotspot data in duplicate rows. The value 20 is configurable. Then, the optimizer separates the values that belong to the values of hotspot data (Data A) from the values that do not belong to the values of hotspot data (Data B) in the T0 table. The optimizer separates the values that can be joined with the values of hotspot data in the T0 table (Data C) from the values that cannot be joined with the values of hotspot data in the T0 table (Data D) in the T1 table. Then, MapJoin is performed on Data A and Data C because the amount of Data C is small, and MergeJoin is performed on Data B and Data D. Then, a UNION operation is performed on the results of MapJoin and MergeJoin to obtain the final results. The values of hotspot data indicate the values of the keys that are used to collect hotspot data.