Parameter optimization of MaxCompute Cartesian product logic & optimization of complex JOIN logic

Introduction: This article mainly talks about two optimization points reflected in a SQL optimization. They are: 1. Parameter optimization of Cartesian product logic. Second, a complex JOIN logic optimization idea.

1. Optimization overview

Recently assisted a project to do the work of optimization task. Because the main data is the report, the logarithmic pair does not dare to adjust the SQL logic casually, so I just want to do some parameter adjustment, but I am forced to change the SQL later.

This article mainly talks about two optimization points reflected in a SQL optimization. They are:

1. Parameter optimization of Cartesian product logic.

Second, a complex JOIN logic optimization idea.

2. Parameter optimization of Cartesian product logic
The earlier version of MaxCompute (ODPS) did not seem to support Cartesian products friendly, and the logic detection of Cartesian products was very strict, but the current version is relatively normal. In the project recently, two performance problems have been encountered. caused by this Cartesian product.

The Cartesian product will cause the expansion of the data volume. If it is some association of some small tables with tens of thousands of data volume, it seems that everyone will not experience any performance problems, but if a certain table is tens of millions, hundreds of millions of this The scale may cause the WORKER of a single data processing to exceed its estimate, causing the SQL to run longer than expected.

2.1. Identifying problems
Generally, for special tuning scenarios, it is not easy to look at the amount of input and output data, and it is not easy to find the logic of Cartesian product by looking at the logic. Therefore, generally start with the issue of execution logs.


As shown in the figure above, it is obvious that there is a data skew in the JOIN stage. In the J4_1_3_6_job_0 stage, only 2 backups were executed in 10 minutes (I didn't take a screenshot, but it actually took longer).

2.2. Coping with skewed brute force parameter methods
The core logic of the inclination is that the data processed by individual WORKERs is many times more than other WORKERs, and the processing capacity of more than one WORKER is too much. Generally, the time for our WORKER to process data is in the second level, but the inclination to a few minutes is generally acceptable. After all, data that is not skewed is difficult to encounter in real situations, but the scenario with backups means that the processing capacity of this WORKER is beyond the processing capacity of the WORKER, and the program actually runs and hangs. Then start another WORKER and continue to run.

In this case, no matter how skewed the data is and why, the simplest and most violent method is:

1- Reduce the amount of data processed by a single WORKER;

-- Increase the number of workers in the map stage [default 256, value range 1-maximum]

SET odps.sql.mapper.split.size = 16;

-- Increase the number of workers in the reduce phase [default -1, dynamic; maximum 1111, manual maximum 9999]

SET odps.sql.reducer.instances = 99;

-- Increase the number of workers in the join phase [default -1, dynamic; maximum 1111, manual maximum 9999]

SET odps.sql.joiner.instances = 200;

Note: A worker occupies at least 1 core of CPU. If the cluster is not large enough, or the resources allocated to the project are limited, the executed WORKER will be queued for execution. In addition, if the data that can be executed quickly by one WORKER is divided into many parts, it will cause more processing steps, resulting in slower processing time. When online task resources are tight, it will be difficult to obtain many WORKERs for execution, resulting in long-term waiting for enough resources to be released. Therefore, it is not the bigger the better, it is best to have a small experiment and stop at it.

2-Let a single WORKER have more computing resources;

-- Increase the CPU in the map stage [default 100, value range 50-800, the official thinks there is no need to adjust]

SET odps.sql.mapper.cpu = 800;

-- Increase the memory in the map stage [default 1024, value range 256-12288, adjustable for a few scenes]

SET odps.sql.mapper.memory = 12288;

-- Increase the CPU in the reduce stage [default 100, value range 50-800, the official thinks no need to adjust]

SET odps.sql.reducer.cpu = 800;

-- Increase the memory in the reduce stage [default 1024, value range 256-12288, adjustable for a few scenarios]

SET odps.sql.reducer.memory = 12288;

-- Increase the number of CPUs in the join phase [default 100, value range 50-800, the official thinks there is no need to adjust]

SET odps.sql.joiner.cpu = 800;

-- Increase the memory in the join phase [default 1024, value range 256-12288, adjustable for a few scenarios]

SET odps.sql.joiner.memory = 12288;

Note: A worker occupies at least 1core of CPU and 1G of memory. You can increase the CPU and memory, so that a single worker has more resources. On the one hand, it will not hang due to insufficient resources, and on the other hand, it can also improve the operation speed. But in fact, the general CPU resources are more tense in the cluster, and the memory is relatively abundant. Starting a worker requires 8 core resources, and starting 10 workers requires 80 resources. When resources are insufficient, a worker that only needs 1 core may obtain resources and start running. This requires more resources and needs to wait. Therefore, it is not the more Bigger is better, but more is not always better. Do not use these parameters as much as possible, and only use them when you find specific problems to solve.

In some scenarios, this brute force method is still very useful. If it can solve the problem without affecting the operation of other tasks, it is also good to use. Especially when we need to solve the problem immediately, simple and rough is correct.

3. An optimization idea for complex JOIN logic
Generally, the JOIN logic is very clear, but sometimes the developers will not be sure, so the best way is to verify it.

Today, I encountered a special association logic: unequal association.

3.1. Understand the original logic
An example of the original script content is as follows:

with ta as(

select a,b from values ​​('12', 2), ('22', 2), ('23', 4), ('34', 7), ('35', 2), ('46' , 4), ('47', 7) t(a, b))

select x.a,sum(t.b) as b

FROM (select distinct a from ta) x

left outer join ta t

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a

group by x.a

;

a b

1 12 2

2 22 2

3 23 6

4 34 7

5 35 9

6 46 4

7 47 11

The above script actually only uses one table. Field a is the statistical date of the monthly report, intercepts a "year" association, and the association restricts the associated statistical month to not be greater than the current month.

In fact, when I saw this way of writing, what came to my mind was to calculate the cumulative value. I thought maxcompute had this cumulative function, but I didn't find it, but the window function can also be calculated. The logic is as follows:

with ta as(

select a,b from values ​​('12', 2), ('22', 2), ('23', 4), ('34', 7), ('35', 2), ('46' , 4), ('47', 7) t(a, b))

select t.a,sum(t.b) over (partition by SUBSTR(t.a, 1, 1) order by t.a) as b

FROM ta t


Through the above piece of SQL, we found that using the window to calculate the cumulative value can replace the original JOIN logic. Because we looked at the execution plan above, we found that the SQL performance bottleneck occurred in the JOIN stage. If there is no JOIN, this problem will naturally not exist.

However, after carefully checking the output of the original script, it is found that the amount of data is inflated, that is to say, the number of records after JOIN is more than the original table. This is mainly because the script retains the unequal association association fields of the two tables of the JOIN, forming [February 2022 associated with January 2022], [February 2022 associated with February 2022], the original February is a row now into two lines. At this time, the window function cannot implement the logic of increasing the amount of data, and can only use JOIN honestly.

3.2. Further analysis of the original logic
After consulting and analysis, it is found that the logic is as follows (the aggregation indication is removed):

with ta as(

select a,b from values ​​('12', 2), ('22', 2), ('23', 4), ('34', 7), ('35', 2), ('46' , 4), ('47', 7) t(a, b))

select t.*,x.*

FROM (select distinct a from ta) x

left outer join ta t

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a;

From the above logic, we can see that the two tables of JOIN are actually output from one table, and the associated field is obtained from the original detailed table group by date (field a). Therefore, it is not necessary to write the two tables as a left join, and an inner join is also possible. Because if there is a left join, there is no way to skip the JOIN step with the method of mapjoin small table, but the inner join is fine. The logic is as follows:

with ta as(

select a,b from values ​​('12', 2), ('22', 2), ('23', 4), ('34', 7), ('35', 2), ('46' , 4), ('47', 7) t(a, b))

FROM ta t

join (select distinct a from ta) x

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a;

As we can see above, the data results are the same, so these two JOIN logics can be replaced. Considering the complexity of the original script, I still modified the original script to do a data comparison test, and the result is that the field values ​​of all records are the same. (The prototype is only one direction, and the specific scenario must be fully verified.) Because the original script is still relatively complicated, I will not show it. I will write the SQL that compares the two writing methods for the whole field, which is convenient for everyone to test and use.

select count(*) from (

select 1

from ta a

join tb b

on 1=1

and coalesce(a.col1,'')= coalesce(b.col1,'')

and coalesce(a.col2,'')= coalesce(b.col2,'')

......

and coalesce(a.coln,0)= coalesce(b.coln,0)

)t

;

Note: For full field comparison and association, the number of records returned should be the same as the number of records in a single table.

4. Other points to note when optimizing long scripts
4.1. Handling multiple references at once
In fact, this original script is very long, about 1000 lines, and the running time of each piece of SQL is not long. Even the Cartesian product scenario mentioned above ran in only ten minutes, but the overall running time exceeded 1 hour and 40 minutes. The SQL inside has written more than ten paragraphs, which is simply manual serial execution of tasks.

But soon I found a problem. The script used insert into many times, indicating that multiple pieces of SQL can actually be written at one time using union all. In particular, I see that many pieces of SQL use the same table and table association logic, just like the piece of SQL I optimized above, and several pieces of SQL are connected with the same table and association logic. At this time, if it is written as union all, the map stage and the join stage actually only need to be done once, and it needs to be done four times if it is written four times. For a script that needs to run for more than ten minutes, the logic running time that originally only takes ten minutes to run is quadrupled. In this script, there are 2 large paragraphs in such places, involving many small paragraphs.

--Create a test table

create table tmp_mj_ta as

select a,b from values ​​('12', 2), ('22', 2), ('23', 4), ('34', 7), ('35', 2), ('46' , 4), ('47', 7) t(a, b);

-- create write table

drop table if exists tmp_mj_01;

create table tmp_mj_01 (xa string,ta string,tb bigint,tc string);

-- 1- Multi-segment insert into writing, the association logic is the same, each segment has a map stage and a join stage

insert overwrite table tmp_mj_01

FROM tmp_mj_ta t

join (select distinct a from tmp_mj_ta) x

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a;

insert into table tmp_mj_01

FROM tmp_mj_ta t

join (select distinct a from tmp_mj_ta) x

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a;

insert into table tmp_mj_01

FROM tmp_mj_ta t

join (select distinct a from tmp_mj_ta) x

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a;

-- 2- Change to union all writing, the association logic is the same, there is only one map stage and join stage

insert overwrite table tmp_mj_01

FROM tmp_mj_ta t

join (select distinct a from tmp_mj_ta) x

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a

union all

FROM tmp_mj_ta t

join (select distinct a from tmp_mj_ta) x

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a

union all

FROM tmp_mj_ta t

join (select distinct a from tmp_mj_ta) x

ON SUBSTR(x.a, 1, 1) = SUBSTR(t.a, 1, 1)

AND x.a >= t.a;

There is also the (select distinct a from ta) logic in the above processing logic, which appears many times. At this time, you can use the with expression to write it at the top, and multiple union steps can be referenced, which is not only clean and readable, but also less error-prone.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us