Developer Content

A full table scan

1 case

SELECT count(*) AS tmp_count FROM (
SELECT * FROM `XXX_rules` WHERE 1 = 1 ORDER BY gmt_create DESC ) a

2 traceability

In the article on paging query management, we have introduced the old paging query logic of our system. The above query sql is obviously a paging query to obtain the total number of records. Through the paging query interface of the XXX_rules table, we can trace the source and find that the calling page is our secondary background. A page that operates merchant access. After the page is opened, the page-by-page query interface is called directly. Except for the page-by-page parameter, no other query parameters are passed in, resulting in scanning the entire table.

3 analysis

Soul torture: Why scan the whole table? The full table data is displayed on the page, is the fancy data useful?

Research: After chatting with operators who often use this page, I learned that the full table data queried by opening the page is useless to operations, and they don't look at the data at all. The operating habit is to get the merchant id, enter the merchant id in the query box on the page, and perform operations after finding the merchant data.

4 solutions

From this, the optimization plan is very clear: do not directly query the full amount of data when opening the page, and use the merchant id as a parameter to query after the operator enters the merchant id. In the XXX_rules table, the common query condition of merchant id is set as an index, combined with pagination query optimization, the slow SQL of full table scanning can be solved.

The optimized secondary background page is as follows:

No data was queried when the page was opened, and the query condition merchant account is required.

The optimized sql is:

SELECT count(*) AS tmp_count FROM (
SELECT * FROM `xxx_rules` WHERE 1 = 1 AND `rule_value` = '2928597xxx' ) a
Execute EXPLAIN to get the following results:

It can be seen that the index is hit, the number of scanned rows is 3, and the query speed is significantly improved.

5 thoughts

Scan full table management is simply to add query conditions, hit the index, and remove the full table scan query. Although it is a bit rough, it is not unreasonable. In actual business scenarios, it is rare to scan the entire table to obtain all the data. Limiting calls to the upstream must pass in query conditions, and the query conditions can hit the index, which can largely avoid slow SQL.

In addition, by extension, the initial purpose of XXX_rules is an access table, which records the access status of the financial cargo owner dimension, with a maximum of several thousand pieces of data, but many colleagues understand this table as a rule table, and write many business-related rules , causing the table to expand to more than one million pieces of data, and the table is not clean anymore. This involves the design and use of the data table, clarifying the usage specifications of the table, and not writing data indiscriminately, which can bring great convenience to later maintenance.

Two index confusion

1 example

2 Analysis

In addition to the time and operator fields, the XXX_rules table has four fields: rule_name, rule_value, status, and product_code. The index of the table makes various arrangements and combinations of these four fields. There are the following problems:

1. The dispersion of rule_name is not high, so it is not appropriate to put it at the first place in the index;

2. The first three indexes have a high degree of coincidence;

Obviously, the hit rules of the index are not well understood. Many businesses in the XXX_rules table have scheduled tasks to write and delete them, and there are many and confusing indexes, which have a great impact on performance.

What are the high-performance indexes, let’s review:

1. Independent column: the index column cannot be part of the expression;

2. Select a column with a high degree of discrimination as the index;

3. Select the appropriate index column order: put the index column with high selectivity in the forefront;

4. Covering index: the columns queried are all in the index, and there is no need to check the clustered index;

5. Use index scan for sorting;

6. Under the principle of observing the leftmost prefix, try to expand the index instead of creating an index.

As long as you remember the rules 3 and 6, you will not build the index like this.

3 Governance

The index is consolidated as follows:

There are many tasks in the system to pull access records under the entire product, and then process them, so put the product_code with a high degree of discrimination at the top of the index, and then add the rule_name and status fields to the index to further filter the data and reduce the number of rows scanned , to avoid slow sql. For common rule_value query conditions, UK can be hit, so there is no need to create a separate index.

Three non-essential sorting

1 Problem description

In many business logics, it is necessary to pull a list of records that meet a certain condition. The sql statement of the query has an order by. When there are many records, the sorting cost is often very high, but whether the queried records are in order has no effect on the business logic. , For example, the count statement discussed in paging management only needs to count the number of entries, and order by has no effect on the number of entries. For example, after the record list is found, it does not rely on the order of the records to traverse the list to process data. At this time, order by is superfluous.

2 solutions

If there is no limit statement in the query sql, and the business processing logic does not depend on the order of the list records after order by, remove the order by statement in the query sql.

Four coarse-grained queries

1 Problem description

There are many scheduled tasks in the business, scanning all the data under a certain product in a certain table, and processing the data, such as:

SELECT * FROM XXX_rules
WHERE rule_name = 'apf_distributors'
AND status = '00'
AND product_code = 'ADVANCE'

The three query conditions are all columns with low degree of discrimination. There are 270,000 pieces of data found, and adding indexes is not very meaningful.

2 Analysis

The actual business volume is not that big, at most several thousand pieces of data, and the data in the table is synchronized from the upstream. The best way is to let the upstream streamline the data, but because the business is too long, it is too difficult to find the upstream person to maintain it. So I can only think of other ways.

The purpose of this scheduled task is to pull out the data under certain products in the XXX_rules table, compare it with the data in another table, and update the data with differences. Processing every morning does not have high requirements for timeliness. Therefore, can the place where the task is processed be transferred instead of processing so many pieces of data in real time on the application machine?

3 solutions

The data is synchronized by the offline task odps. The first thing that comes to mind is the dataWork data processing platform.

Create a data comparison task, put the data comparison logic of the scheduled task on dataWork and use SQL to implement it. There are at most hundreds of pieces of difference data per day, and the result set contains highly differentiated columns. Write the difference data into the odps table, and then The data flows back to idb.

Create a new scheduled task, query XXX_rules and update XXX_rules by using highly differentiated columns in the returned difference data as query conditions, and solve the problem of slow SQL.

The premise of this method is that the requirements for data effectiveness are not high, and the result set produced offline is small.

Five OR causes the index to fail

1 case

SELECT count(*)
FROM XXX_level_report
WHERE 1 = 1
AND EXISTS (
SELECT 1
FROM XXX_white_list t
WHERE (t.biz_id = customer_id
OR customer_id LIKE CONCAT(t.biz_id, '@%'))
AND t.status = 1
AND (t.start_time <= CURRENT_TIME
OR t.start_time IS NULL)
AND (t.end_time >= CURRENT_TIME
OR t.end_time IS NULL)
AND t.biz_type = 'GOODS_CONTROL_BLACKLIST'
)

2 analysis

explain the above query statement, the results are as follows:

The XXX_white_list table has biz_id as an index. Here, when querying the XXX_white_list table, biz_id is passed in as the query condition. Why is the type in the explain result set to ALL, that is, the entire table is scanned? Index is invalid? What happens when an index becomes invalid?

Index failure scenario

1. There are missing indexes on the left and right sides of the OR query;

2. The composite index does not satisfy the leftmost matching principle;

3. Like begins with %;

4. Type conversion is required;

5. The index column in where has operations;

6. The index column in where uses a function;

7. If mysql thinks that the full table scan is faster (when there is less data)

In line 8 of the above query statement, customer_id is a field of the XXX_level_report table, which misses the index of the XXX_white_list table, causing the index to fail.

3 solutions

This statement is stitched together with condition, enumeration, and join fancy codes. It is very troublesome to change, and it seems that the sentence "OR customer_id LIKE CONCAT(t.biz_id, '@%')" cannot be deleted directly. Finally, the query statement in this part is reconstructed, the or query is removed, and the slow SQL is solved.

Classic case sharing of slow sql governance

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse