By Chengcheng
Recently, due to the need for multi-tenant transformation of the application, some problems with the application were found. The HSF consumer success rate of the application was not always 100 but occasionally 99.99. After reviewing the application log, it was found that the slow SQL caused the service timeout failure.
Locate a specific statement.
(
select
`iop_xxx_msg`.`id`,
`iii_sss_msg`.`message_id`,
`iii_sss_msg`.`title`,
`iii_sss_msg`.`content`,
`iii_sss_msg`.`id_seller`,
`iii_sss_msg`.`id_user`,
`iii_sss_msg`.`gmt_create`,
`iii_sss_msg`.`gmt_modified`,
`iii_sss_msg`.`is_read`,
`iii_sss_msg`.`category`,
`iii_sss_msg`.`sub_category`,
`iii_sss_msg`.`description`,
`iii_sss_msg`.`need_side_notice`,
`iii_sss_msg`.`link_url`,
`iii_sss_msg`.`btn_name`,
`iii_sss_msg`.`gmt_create_l`,
`iii_sss_msg`.`mobile_content`,
`iii_sss_msg`.`tier`,
`iii_sss_msg`.`requirement_id`,
`iii_sss_msg`.`fk_template_id`,
`iii_sss_msg`.`business_part`,
`iii_sss_msg`.`business_id`
from
`iii_sss_msg_29` `iii_sss_msg`
WHERE
(
(
(
`iii_sss_msg`.`gmt_create` >= '2023-07-24 00:00:00'
)
AND (
`iii_sss_msg`.`gmt_create` < '2023-07-31 15:46:45.684'
)
AND (`iii_sss_msg`.`id_user` = 500173482096)
AND (`iii_sss_msg`.`tier` IN ('S', 'A'))
AND (
`iii_sss_msg`.`sub_category` IN (1000305, 1000306, 1000501, 1000502)
)
)
OR (
(`iii_sss_msg`.`category` IN (10003, 10005))
AND (
`iii_sss_msg`.`gmt_create` >= '2023-07-24 00:00:00'
)
AND (
`iii_sss_msg`.`gmt_create` < '2023-07-31 15:46:45.684'
)
AND (`iii_sss_msg`.`id_user` = ***)
AND (
`iii_sss_msg`.`sub_category` IN (1000305, 1000306, 1000501, 1000502)
)
)
)
order by
`iii_sss_msg`.`gmt_create` desc
limit
0, 5
)union all ...
The statement is relatively complicated. Generally speaking, the meaning of the SQL statement is to find the last five messages of certain categories or with tier='S' of a certain merchant within seven days. The iii_sss_msg table is split into 31 tables based on the creation time. The routing rules are as follows:
<property name="tbRuleArray">
<value>"iii_sss_msg_" + getCalendar(#gmt_create,1_date,31#).get(Calendar.DAY_OF_MONTH)</value>
</property>
Therefore, the query connects seven tables iii_sss_msg_29, iii_sss_msg_30...(This table partitioning rule is quite unreasonable, and the analysis will be described later in this article.)
Let's look at the XML file corresponding to MyBatis.
<select id="selectByQuery" resultMap="webMsgResultMap" parameterType="map" >
select
<include refid="Base_Column_List" />
from iii_sss_msg
where
id_user = #{userId}
<if test="startTime != null">
and gmt_create <![CDATA[>=]]> #{startTime}
</if>
<if test="endTime != null">
and gmt_create <![CDATA[<]]> #{endTime}
</if>
...
It can be found that the where condition sorting of the SQL statement corresponding to MyBatis is different from that of the last SQL statement we see. Is there anything that changed this statement?
Since our application uses TDDL, the SQL generated in MyBatis will be changed by TDDL which will parse and optimize the MySQL at the Matrix layer. Its workflow is as follows.
The answer to the question above is that the TDDL changes this statement to optimize the query. So, the next question is: Has the purpose been achieved?
Let's continue the analysis. The table index information is as follows:
The table has an index named idx_user which is a joint index of the four fields gmt_create, id_user, category, and sub_category. These fields are all in the where condition of the slow SQL above. It seems that TDDL adjusts the order of SQL statements for the purpose of using that index.
We use the execution plan to analyze the statement. The result is as follows:
The index used is not idx_user(gmt_create,id_user,category,sub_category), but idx_uer_query(id_user, category). It seems that TDDL has done something useless. Why does MySQL choose the index idx_user_query(id_user, category)?
To verify this problem we use force index (idx_user (gmt_create,id_user, category,sub_category))
to force the use of the idx_user.
By comparing the two execution plans, we find that the number of scanned rows using the index idx_uer_query (id_user, category) is 13,948, but the number of scanned rows using idx_user (gmt_create,id_user,category,sub_category) index is 1,552,218, which is over 100 times the amount of the former. How does MySQL know that it is better to use the index idx_user?
Selecting an index is the job of the optimizer, which aims to find an optimal execution scheme and execute the statement at the least cost. The optimizer considers the following conditions:
According to point 1 and point 5, the idx_user (gmt_create,id_user,category,sub_category) is better, but what about point 2?
We use:
SHOW INDEX FROM `iii_sss_msg_29`
to analyze the cardinality of two indexes:
It is found that the cardinality of the column gmt_create of idx_user (gmt_create,id_user,category,sub_category) is 99,933, while the cardinality of id_user of idx_uer_query (id_user, category) is 286,528 which is better than the former.
Therefore, the optimizer uses the idx_uer_query (id_user, category) index.
The difference between the cardinality of the two indexes we analyze above is not very large (the optimizer may also make a mistake). However, the index idx_user(gmt_create,id_user,category,sub_category) matches greater in the conditions of the query statement. That prevents sorting because the index idx_user(gmt_create,id_user,category,sub_category) has already been sorted. If the index idx_user(gmt_create,id_user,category,sub_category) is selected, sorting is not required, and only traversal is required, so the cost is determined to be smaller even if the number of scanned rows is larger. Will this happen?
Let's execute it multiple times and look at the execution plans of other table partitionings. We will find that the optimizer uses the index idx_user this time. The combination of the TDDL optimizer and the idx_user(gmt_create,id_user,category,sub_category) causes the optimizer to mis-select the wrong index. The execution plan with more than one million scanned rows is selected.
The index idx_user(gmt_create,id_user,category,sub_category) takes 995 milliseconds for execution:
The index idx_uer_query(id_user,category) takes 95 milliseconds for execution:
This may be the reason for the slow SQL: the TDDL optimizer chooses the wrong index.
Let's look at the following index:
Why is the index idx_user(gmt_create,id_user,category,sub_category) created? Probably it is created for business requirements that have many statements with the conditions above. But is the index reasonable?
The preceding figure shows the B+ tree structure of the index idx_user(gmt_create,id_user,category,sub_category). We can see that the B+ tree only maintains the orderliness of the first column gmt_create but does not maintain the orderliness of other fields. According to the table partitioning rules, the iii_sss_msg_29 is created between (2023-07-29 00:00:00-2023-07-29 23:59:59).
Combine the where conditions of the previous SQL statement:
WHERE
(
(
(
`iii_sss_msg`.`gmt_create` >= '2023-07-24 00:00:00'
)
AND (
`iii_sss_msg`.`gmt_create` < '2023-07-31 15:46:45.684'
)
It can be seen that if this index is used, all leaf nodes of idx_user will be scanned again. Previously, when it was forced to use this index to query, the number of scanned rows was more than 1 million, and the cardinality of the primary key id index was more than 3 million, both of which were of the same order of magnitude.
The index idx_user(gmt_create,id_user,category,sub_category) is not reasonable in the preceding SQL application scenario, especially with the table partitioning rules based on the creation time. The author changes it into idx_user(id_user,category,sub_category,gmt_create,is_read). Since whether it has been read is also common in business conditions, it is added here. After the change, let's check the execution plan of the SQL statement again. This time, the optimizer firmly selects the index idx_user(id_user,category,sub_category,gmt_create,is_read). and only over 10,000 rows are scanned. Compared with the index idx_uer_query(id_user,category), this index reduces the scanning of the base table.
MySQL's index cannot be created completely when the table is created. Index accelerates querying data and is strongly related to business scenarios. Therefore, the index should be changed according to the business query, because unreasonable indexes will hinder queries and mislead the optimizer.
Is this shard key reasonable for the iii_sss_msg table to be split based on gmt_create?
The choice of shard keys should focus on:
The most common scenario of our application is to check the messages of some categories of a certain merchant within seven days. However, the iii_sss_msg message table is split according to gmt_create, which frequently triggers multi-table join queries. Each query retrieves data from seven tables. If we want to see the messages within one month in the future, we must retrieve data from thirty tables. It seems that this shard key is very unreasonable. According to the business scenario, id_user is a better choice as the shard key.
This article starts with the problems encountered and analyzes the TDDL optimizer, the MySQL index, and the selection of shard keys for table partitioning. The following is a summary:
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Green by Design: Inside Alibaba's International Headquarters Built for Sustainability
1,065 posts | 262 followers
FollowAlibaba Clouder - December 5, 2017
Data Geek - May 10, 2024
Data Geek - May 10, 2024
Alibaba Clouder - December 12, 2017
Alibaba Clouder - November 21, 2019
ApsaraDB - June 19, 2024
1,065 posts | 262 followers
FollowAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn MoreAn on-demand database hosting service for MySQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreApsaraDB Dedicated Cluster provided by Alibaba Cloud is a dedicated service for managing databases on the cloud.
Learn MoreMore Posts by Alibaba Cloud Community