PolarDB-X Optimizer Tips | Incremental Prediction of Time Columns in Statistics

This article introduces PolarDB-X's incremental prediction mechanism for time column statistics to optimize query execution plans and prevent slow queries.

By Fangwu

Slow Queries Related to Time Conditions

In most business scenarios, a detailed table contains a time column. This time column might be the creation time or the modification time.

Many ad-hoc queries or analytical queries include time information.

During ticket processing, we found that analytical queries that target the latest data often result in slow queries. These slow queries usually have the following characteristics:

• The data that needs to be accessed is a large amount of newly written data. The data is filtered by specifying that the time column is greater than a certain value.

• The SQL statement itself is a relatively complex analytical query.

• The optimizer chooses a more aggressive execution plan targeted at small data volumes.

In other words, although the business request is clearly an analytical query targeting a large data volume, the optimizer considers that the query involves a small data volume. Therefore, the optimizer chooses an aggressive execution plan (this plan is significantly better than an analytical plan when the involved data volume is small, but significantly worse than an analytical plan when the involved data volume is large).

Therefore, you can easily deduce that the optimizer is misled by statistics and mistakenly believes that the data volume that is involved in the query is small. Why is this problem prone to occur on detailed tables that have time columns?

Problem Analysis

For non-time numeric types, newly written data generally does not cause frequent adjustments to data boundaries. In the long run, the overall distribution of data tends to be stable. For example, salary adjustments of personnel generally maintain a specific distribution pattern.

However, for time column types, newly written data usually continuously breaks the boundaries. For example, the time of a new order continuously increases based on the current time. New data is always written to the head of the time column. Other types of data rarely have this characteristic. It would be terrible if the salaries of newly joined employees are always higher than those of senior employees.

This situation where the data distribution continuously extends forward will cause the optimizer to misjudge the data volume because the statistics are not updated in a timely manner.

Problem Definition and Derivation

First, you must consider two questions:

• The essence of this problem is that the part of the statistics that is not updated in time requires a certain compensation scheme. How do you define the range of the missing data in the statistics?

Assume that the time when the statistics of the time column are updated is U, and the current time is N. The data that needs to be simulated and compensated is the data newly written by the customer during the N-U period.

• How do you compensate?

The newly written data of the customer naturally increases over time in the time column (otherwise, prediction would not be needed). Assume that the current upper limit of the time column in the statistics is T.

Then the time range that needs to be simulated and compensated is T~T+N-U. Because it is assumed that the data in the time column naturally increases over time, the data in the compensation range is all newly written data, and the reference curve for the simulated compensation is T-(N-U)~T.

Start date S

End date E

Collection update time U

Current time point N

Statistics expiration time point R

Statistics range Frequency result (SE is the range) F<S,E>

Statistics estimated value C<S,E>

Statistics compensation value O<S,E>

Current upper limit of the time column in the statistics T

# Calculate the upper limit time point X of the simulated interval
X = T + (N-U)

# Take the intersection of the condition range and the simulated interval
<S, E> intersect <T, X> => <S', E'>

F<S,E> = C<S,E> + O<S',E'>

# Convert O<S',E'>
O<S',E'> = C<T-(X-S'), T-(X-E')> 
         = C<T+S'-X), T+E'-X)> 
# Substitute O
F<S,E> = C<S,E> + C<T+S'-X), T+E'-X)> 

# Substitute X
F<S,E> = C<S,E> + C<T+S'-(T + (N-U)), T+E'-(T + (N-U))>
       = C<S,E> + C<T+S'-T-N+U), T+E'-T-N+U>
       = C<S,E> + C<S'-N+U), E'-N+U>
# Substitute S' E'
F<S,E> = C<S,E> + C<S'-N+U), E'-N+U>
       = C<S,E> + C<Max(S,T)-N+U), Min(E,X)-N+U>

Practice and Verification

First, you can prepare a PolarDB-X instance, and initialize the environment by using the following statements:

You can initialize the environment by using the following statements:

create database test mode=auto;
use test
CREATE TABLE `order` (
    `id` bigint(20) NOT NULL AUTO_INCREMENT,
    `create_time` datetime NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE = InnoDB AUTO_INCREMENT = 11 DEFAULT CHARSET = utf8mb4
PARTITION BY KEY(`id`)
PARTITIONS 8;

insert into order (create_time) values(
('2024-10-01 01:01:01'),
('2024-10-01 01:01:02'),
('2024-10-01 01:01:03'),
('2024-10-01 01:01:04'),
('2024-10-01 01:01:05'),
('2024-10-01 01:01:06'),
('2024-10-01 01:01:07'),
('2024-10-01 01:01:08'),
('2024-10-01 01:01:09'),
('2024-10-01 01:01:10')
);
analyze table order;

You can change the date 2024-10-01 to any date, as long as you ensure that it is at the same time.

We observe the working status of the statistics through the unique EXPLAIN COST_TRACE command of PolarDB-X:

explain cost_trace /*TDDL:a()*/ select * from order where create_time>''

You can pay attention to the information in the red box:

• Catalog:test,order,create_time,2024-10-01 01:01:06_null indicates accessing the statistics interface to query the data volume in the order table of the test database where the value of the create_time field is greater than 2024-10-01 01:01:06.

• Action:datetimeTypeCompensation indicates that the statistics estimation data comes from the time column prediction interface.

• StatisticValue:12 indicates that the result returned by the interface is 12.

• normal val and compensation value respectively indicate what the normal value and the time-based predicted value are. In the current case, they are 4 and 10, respectively.

Note 1: The compensation range is limited to the window period between the statistics update time and the current time. The maximum value of the compensation is the quantity in the existing statistics. In the current case, you can observe that after the statistics are updated, the estimated value in cost_trace will continuously increase as time passes, until the estimated value reaches the maximum value of 10 and no longer increases.

Note 2: Range conditions can trigger time prediction, such as between, greater than signs, and less than signs.

Conclusion

In MySQL, newly written data is synchronized in the index structure. Therefore, you can use index dive to retrieve real-time data distribution characteristics. However, in the compute nodes of PolarDB-X instances, directly accessing the index in storage requires going through the network, and the overhead is very large. Judging by periodic statistics collection will have delays. Moreover, newly written data in the time column is more easily hit by time range conditions, and the impact on the plan is very large. Therefore, to solve the delay problem of periodic statistics collection, the statistics of PolarDB-X are based on the past distribution characteristics of the time column, and automatic simulation is used to fill in the statistics for the uncollected time periods. This partially solves and optimizes the problem of incremental data estimation for time columns.

Community

PolarDB-X Optimizer Tips | Incremental Prediction of Time Columns in Statistics

Slow Queries Related to Time Conditions

Problem Analysis

Problem Definition and Derivation

Practice and Verification

Conclusion

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for Xscale

PolarDB for PostgreSQL

PolarDB for MySQL