Come and discuss ways to speed up SQL execution - Alibaba Cloud Developer Forums: Cloud Discussion Forums

Kenan
Assistant Engineer
Assistant Engineer
  • UID621
  • Fans1
  • Follows0
  • Posts55
Reads:1665Replies:0

[Others]Come and discuss ways to speed up SQL execution

Created#
More Posted time:Oct 17, 2016 15:44 PM
Locate issues
• Service performances of the system
• Overloaded CPU of MySQL host
• Monitored RDS and other cloud databases
• Add a slow SQL query layer, such as adding a slow SQL query warning using Spring AOP to rewrite the data source.
• With the explain command, we found that no index is hit with the where condition, or too many rows are used to get the returned results. The type corresponds to the types used for the query. For example, All stands for full-table scans and ref stands for index scans. There are also range scans, index unique scans and so on. It is better to reach the ref level for all the scans.
MySQL overall structure
Before looking into the slow SQL query issue, let's first take a look at the overall structure of MySQL and the execution process of an SQL statement.


• For Connection Pool, you can increase the connections and enhance concurrency performance through setting max_connections=200, but it will increase CPU consumption and lock overhead.
• Parser.  A parser will parse a SQL statement into the syntax tree, validate syntax correctness and cache the syntax tree.
• The optimizer re-determines the table sequence, selects the index and caches the hits.
• Cache & Buffer.  It can be divided into two parts: the query cache and the index cache. The innodb_buffer_pool_size sets the index cache. Basically, the larger the cache, the better.
Is it true that query cache can speed up the query?
Three important points:
• You can cache the execution plan to save the parsing and execution plan generation processes.
• You can cache the data, and only when the SQL statement is fully matched will it hit the cache.
• The cache is invalidated immediately after the related table changes.
Okay. Next we can evaluate whether it helps.
• Caching the execution plan is helpful. This is also why views are a little quicker, because views are cached as an execution plan as constructed by the parser and the optimizer.
• Space-insensitive matches are not perfect matches. If there are uncertain factors for the match, it cannot be regarded as perfect matches, for example, if the query contains "now()".
• That is to say, any inserts or updates will lead to invalidation.
• The cache has an exclusive lock, which will become its bottleneck. (When the cache is being updated, query is congested)
• The sole standard for evaluating cache: cache hits, which is very low.  
Conclusion: Basically, no one will depend on the cache.
So is the buffer helpful? Yes and it is quite helpful. I will explain it later.
Some basic principles and misunderstandings in writing SQL statements
The sole principle of writing good SQL statements is to simulate how databases perform filtering in your mind. The premise is that you do know a lot about it. So let's first look at some basic principles and misunderstandings in writing SQL statements.
Should * be replaced by a specific field?
Let the test talk.
SELECT * FROM fm_news_news WHERE id=95950;
SELECT id FROM fm_news_news WHERE id=95950;


In addition to IO reduction, * will also correspond to a dictionary mapping.
Are the smaller conditions tending to be on the right side?
SELECT * FROM fm_news_news WHERE create_org = 3 AND date_publish > '2016-08-30 23:59:59';

After we switch the sequence and run the EXPLAIN command, we found that they are exactly the same.  But why?
Because it is a full-table scan. Every scan of a row matches the two items, so no matter where they are, it is a full-table scan.
Then we change to an indexed SQL statement:
EXPLAIN SELECT * FROM fm_news_news WHERE id>95950 AND date_update > '2016-08-30 23:59:59';

After we switch their positions, we found the result remains unchanged. This is also because the optimizer plays its role, making an initial calculation of the rows to be filtered, and then selecting the index.
Unique index > regular index. For other indexes, the index, instead of the SQL statement position, is first defined.
Can OR statements replace IN statements?
In fact, the optimizer will automatically change the IN format into the OR format and the modifications are not necessary to be shown.
The n+1 issue in Hibernate.
When you set delayed loading in Hibernate, the issue of N+1 queries to databases may occur. You can enforce the LEFT JOIN.
Can EXIST statements replace IN statements?
First, the application scenarios are different - EXIST only applies to subqueries.
SELECT * FROM fm_news_news n WHERE EXISTS (SELECT 1 FROM fm_news_channelnews cn WHERE n.id = cn.news_id AND cn.channel_id=120 AND cn.channel_type = 5)
SELECT * FROM fm_news_news WHERE id IN (SELECT news_id FROM fm_news_channelnews WHERE channel_id=120 AND channel_type = 5);
SELECT * FROM fm_news_news n LEFT JOIN fm_news_channelnews cn ON n.id=cn.news_id AND cn.channel_id=120 AND cn.channel_type = 5;
// The above code will first query the temporary table of channel_news, then news_news is used for the LEFT JOIN with this temporary table. So it is normal that it is low.
SELECT * FROM fm_news_news n LEFT JOIN fm_news_channelnews cn ON n.id=cn.news_id WHERE  cn.channel_id=120 AND cn.channel_type = 5;


Query 2 and Query 4 enjoy the best performance.
They both hit the index. There are some suggestions on the internet that imply IN statement subqueries won't hit the index, which is wrong.
Why is Query 2 quicker than Query 1?
First, let's take a look at how IN statements perform the query. In IN statements, the subqueries in () are executed only once. The result set int[] is returned, followed by traversing for(int i : int[]) to find the results that meet the requirements in the primary table.
For an EXIST statement, it will first perform a full-table scan of A to get the result set of News[] ns. Then it traverses the ns and judges whether “true” is returned for every query in the EXIST condition. So in essence, it is an EXIST full-table scan.
Select the right drive table. For IN statements, subqueries are the drive table. For EXIST statements, outer drive tables are the drive table.
IN statements do not process null values.
SELECT * FROM fm_news_news WHERE brief IN (NULL) LIMIT 10 Such a query will return a null result and only IS NULL can be used;
But columns with null values will not be indexed, so “is not null” or “is null” statements do not go through indexes. As a result, we cannot leave the index field as null when inserting data.
In fact, major factors that decide the query speed include the following:
1. Data size of the two tables, as well as the data distribution;
2. Whether the table is analyzed;
3. Whether the subqueries contain NULL values (very important);
4. Whether an index exists;
5. Database version: the working modes of optimizers for different versions of databases may vary. Subqueries are available for queries in the latest version of IN.
Can NOT EXIST statements replace NOT IN statements?
NOT EXIST statements still use the left table as the drive table for full queries to judge whether the condition is true.
SELECT * FROM fm_user_devicechannel dc WHERE dc.channel_id NOT IN (NULL);
 SELECT * FROM fm_news_channelnews cn WHERE NOT EXISTS (SELECT 1 FROM fm_user_defaultchannel df WHERE cn.channel_id = df.channel_id );
SELECT * FROM fm_user_devicechannel dc WHERE dc.channel_id NOT IN (SELECT channel_id FROM fm_user_defaultchannel df);


The NOT IN null statement will return null results.  So it is only meaningful when the results of subqueries contain no null values.
In other cases, because NOT EXIST statements will use indexes, and NOT IN statements are entirely double-table full-table scans, thus the slowest. Theoretically speaking, they should never be used.
Filter, then GROUP BY; instead of first GROUP BY and then HAVING
Should I prioritize the usage of complicated statements over multiple accesses to the database?
This is increasingly not true, especially with the prevalence of SOA and microservices. Actually, splitting statements boasts the following advantages:
• Increase the cache efficiency.  If the first query has been executed, it can be skipped.
• Reduce lock competition.
• By associations at the application layer, we can split databases to gain better expansibility.
• Use IN () to replace associated query to gain a higher efficiency.
• The result of the first query can be reused at the application layer, such as used as cache.
Next, we will talk about the index separately.
Index
Why are indexed queries faster?
This is decided by the data structure of the index. The index is a binary tree the traversing process of which is a binary search algorithm. The complexity of the search is logn, and n represents the height of the tree.
Because the efficiency is related with the tree height, a general index is a balanced binary tree which means the heights of the two sides are the same as much as possible. Left rotate and right rotate will be added at inserts, but the query efficient is very high.
Further evolved, the binary tree will become the B+ tree, the structure of which is shown in the figure below:


We can see that each pivot can contain multiple values. This is mainly because of the disk IO reading attribute, that is, continual reading is far quicker than random reading. A pivot of B+Tree will be placed in a continual space and, through the page loading mechanism of the system, read out multiple data at one time to greatly reduce disk I/O times and enhance the performance.
The above method guarantees that indexed queries are faster.
Are indexed queries fast for sure?
• Not always. First, if it is not a unique value, the query is not necessarily fast.
• In addition, the index page is first queried, and then the data page is queried. Generally speaking, when an indexed query needs to involve more than 70% of data, the indexed query will be slower than no-index queries.
• If the data size involved is very small, non-index queries are also faster. For example, bubble sorting is faster than binary search when data size is small.
Clustered indexes, non-clustered indexes and covered indexes
Clustered index:
Similar to a dictionary with no catalogs. The table data is stored in the order of index, that is, the leaf node is the table data.  So a table can have one clustered index at the most. Updating performance is low.
But because the locations are physically adjacent, the range search will be faster.  
Another issue is the tail hotspots which may compromise the performance.
Non-clustered index:
The leaf node is actually the pointer to the real physical data and the content of the index field, like the catalog.
Inserting performance is good, but range query won't be so advantageous.
Covered indexes:  All the data are index fields and the query results will be obtained by directly traversing the indexes in the memory.
Index hit issues
• The join column won’t use index a||""||b
• The LIKE right wildcard such as ‘a%’ can hit index, but the left wildcard such as ‘%b’ won’t hit the index.
• A combined index, such as (a, b, c) queries a AND b, or a. But it won’t work for b AND c.
• The != won’t use the index, and it should be changed to the range query.
• Few updates and high requirements for data uniqueness. Short columns are more suitable for indexed queries.
• Function calculations for columns won’t use indexes, such as md5(colume1).
• Function calculation works well with constants, such as date_publish > DATE_FORMAT(now(), '')
• NULL fields won’t hit index. So IS NULL and IS NOT NULL conditions should be avoided.
• If indexed fields all hit the index, they will become covered indexes which feature a higher efficiency.
• An important note is that the ON statements of LEFT JOIN must be indexed, otherwise it will become a full-table query.
Continuing from the section about buffer above, this is used to save the index. You can set it using innodb_buffer_pool_size. Usually 70% to 80% of the memory of this machine is consumed by it.
Query optimization in sub-databases/sub-tables
• Global table.
Inserts and updates will be synchronized to all the shards.
One shard will be used for queries.
• ER Join
Table grouping thinking: the storage place of sub-tables is dependent on the primary table to solve the join problem.
• Share Join
Supports two tables. It parses SQL statements for separate queries and then joins the results.
Guest