Dave
Assistant Engineer
Assistant Engineer
  • UID627
  • Fans3
  • Follows0
  • Posts55
Reads:2459Replies:0

[SQLServer Development]SQL optimization

Created#
More Posted time:Aug 3, 2016 16:20 PM
SQL optimization
The DRDS is an efficient and stable distributed relation database system. However, as distributed relation query is processed by DRDS, its query optimization for SQL is different with that of the traditional single database (e.g. mysql, oracle). When in query optimization, the traditional single database mainly considers the expense of disk IO, but DRDS, during optimization, will also consider another more important IO expense, i.e. network. For optimizing SQL execution of DRDS, the core optimization idea is to reduce network IO. For this purpose, the DRDS will possibly distribute the work originally belonging to DRDS to each sub-database (e.g. RDS) at the bottommost layer, which can transfer IO expense originally needing network into stand-alone disk IO expense, thus increasing the execution efficiency of query. Therefore, if we encounter slow SQL when using DRDS, we shall properly rewrite SQL as per the characteristics of DRDS.
Condition optimization of SQL
As data of DRDS are horizontally sharded as per the split key, use of split key in query will be very meaningful to reduce execution time of SQL in DRDS. The query condition shall be possibly provided with a sub-database key, which can enable DRDS to directly route query to a specific sub-database according to the value of the sub-database key, which is useful to avoid whole database scan by DRDS. The higher the selectivity of the condition including the shard key is (or the higher the discrimination is), the easier it will be to increase DRDS’s query speed. For example, equality query will be executed more quickly than range query.
JOIN optimization of SQL
In SQL, Join operation will often be the most time-consuming. Join algorithm used by DRDS in most cases is Nested Loop and its derived algorithm (if Join has a sequencing request, Sort Merge algorithm is used). DRDS’s Join process based on Nested Loop algorithm is as follows: for left and right tables of Join, DRDS will fetch data from the left table (also called as driving table) of Join, and then put the value in the Column Join in data fetched into the right table for IN query so as to finish Join process. Thus, the less the data quantity of Join's left table is, the less the times of IN query made by DRDS for the right table is. The less the data volume of right table is or index is created, the quicker Join will be. Therefore, in DRDS, selection of Join’s driving table is very important for optimization of Join.
Short table as driving table of Join
The so-called short table does not say that the table is the record number of the table in database, but the number of records returned after the table is subject to condition filtering in query. Thus, the most simple method for determining the actual data volume of a table is to attach where conditions and join on conditions related to the table, and put them in DRDS for a count (*) query independently to view data volume. For example, assuming that there is SQL shown as follows:

select t.title, t.price
from t_order o,
    ( select * from t_item i where i.id=242002396687 ) t
where  t.source_id=o.source_item_id and o.sellerId<1733635660;
Its query speed is very slow, shown as follows:



About 24 seconds are required. Seeing the SQL, it is an inner JOIN. We do not know the actual data volume of table o and table t in JOIN process, but we can conduct count () query on the table o and table t respectively to obtain the group of data. For table o, we observe that o.sellerId<173363560 in where conditions is only related to table o, we will extract it out, and attach into the count () query of table o, thus obtaining the following query results;And then we may know that table o has 50W records. Similarly, for table t, as this is a sub-query, table t is extracted directly for count (*) query, then:



We can know the data volume of table t is only one. Therefore, we can determine that table o is a long table and table t is a short table. Based on the principle of possibly using short table as Join driving table, we will adjust SQL into:

select t.title, t.price
from
( select * from t_item i where i.id=242002396687 ) t,
 t_order o
where  t.source_id=o.source_item_id and o.sellerId<1733635660
The query results are as follows:



The query time is reduced to 0.15 seconds from 24 seconds, with large increase. A broadcast table is used as the driving table of Join


broadcast table as driving table of Join


As the broadcast table of DRDS will be stored in every sub-database, the broadcast table, when being used as the driving table of Join, will be converted into stand-alone Join together with Join of other tables, thus increasing query performance. For example, assuming that there are SQL (where table t_area is broadcast table) shown as follows:

select t_area.name
from  t_item i join t_buyer b on i.sellerId=b.sellerId join t_area a on b.province=a.id
where  a.id < 110107
limit 0, 10

The three tables are used as JOIN, with query results as follows:




The execution time is fairly long, about 15 seconds. Now, we will adjust the sequence of join, and put the broadcast table at the leftmost side as the driving table of join, that is:

select t_area.name
    from t_area a join t_buyer b on b.province=a.id join t_item i on i.sellerId=b.sellerId
    where  a.id < 110107
limit 0, 10
Then, the whole join will be pushed down to be stand-alone join in DRDS. We will observe the execution results SQL adjusted again:




Limit optimization of SQL

When DRDS is executing limit offset, count sentence, it actually reads the previous records of offset in order and discards them directly, which will result in very slow query when offset is very big, even though count is very small. Taking SQL as an example:
SELECT *
FROM t_order
ORDER BY t_order.id
LIMIT 10000,2
Although SQL only fetches 2 records (i.e.10000 and 10001), but its execution time is about 12 seconds, which is because that the number of record actually read by DRDS is 10002, as shown in figure below:


In light of the above conditions, the optimization direction of SQL is that: the ID set of SQL is checked firstly, and then the real record contents are queried through in query, with SQL rewritten shown as follows:

SELECT *
FROM t_order o
WHERE o.id IN (
        SELECT id
        FROM t_order
        ORDER BY id  
LIMIT 10000,2  )
The purpose of rewriting is to buffer ID (ID is not too much) with internal memory, so that disk IO will be reduced. If the sub-database key of table t_order is id, DRDS can route the in query to different sub-databases through rule calculation for query, which avoids whole database scan. We will observe the query results of SQL rewritten again




The execution time is changed into 1.08 seconds from the original 12 seconds, thus reducing an order of magnitude.


ORDER BY optimization of SQL


In DRDS, by default, ensure that column name behind Distinct, Group and Order By sentences are identical, and the final SQL only returns a small number of data. This is because that we, in this case, can minimize the network band width consumed in distributed query without fetching a large number of data for sequencing in table, and enable system performance to implement the optimal status.


For example, for SQL shown below:
select buyer_id,
        count(*) as maxSize
from t_trade
group by buyer_id
order by maxSize desc
limit 1

During execution, DRDS shall sequence data by buyer_id for aggregation, and then sequence the aggregation results by maxSize. Because of existence of two different sequencing requirements, DRDS cannot finish the SQL at one time if middle results are not preserved with the temporary table, thus during actual execution of the SQL, the following errors will be reported:




Error reporting contents prompt that the temporary is not allowed to use. DRDS supports use of temporary table, but does not recommend by default. As use of temporary table generally means that the system has a performance bottleneck, avoid use of temporary table by making column names of Distimct Group by and Order by identical possibly. If the SQL is executed successfully, DRDS can be told, by adding HINT in SQL, that use of temporary table is allowed, with SQL modified as follows (green part):


/*+TDDL({'extra':{'ALLOW_TEMPORARY_TABLE':'TRUE'}})*/
select buyer_id,
       count(*) as maxSize
from t_trade
group by buyer_id
order by maxSize desc
limit 1

In this way, the DRDS can execute the SQL successfully. However, as the HINT may be ignored as a note in some mysql clients, hint is generally added when SQL is sent to DRDS through the mysql connector. Note: Confirm that the temporary table can be used only when there is a small number of data in the temporary table. Otherwise, the system will have a serious performance problem ####.
Guest