ApsaraDB for SelectDB allows you to manually trigger the collection of statistics or enable automatic collection of statistics. You can optimize query performance based on the collected statistics.
Overview
ApsaraDB for SelectDB supports manual and automatic collection of statistics. The collected statistics helps the optimizer understand data distribution characteristics. During cost-based optimization (CBO), the optimizer calculates the selectivity of predicates and estimates the cost of each execution plan based on the collected statistics. This way, the optimizer can select optimal query plans to greatly improve query efficiency.
Statistics to be collected
In most cases, the following information is collected for each column.
Information | Description |
row_count | The total number of rows. |
data_size | The total amount of data. |
avg_size_byte | The average length of values. |
ndv | The number of different values. |
min | The minimum value. |
max | The maximum value. |
null_count | The number of null values. |
Statistics collection
ApsaraDB for SelectDB allows you to execute the ANALYZE statement to manually collect and update statistics.
Syntax
ANALYZE < TABLE | DATABASE table_name | db_name >
[ (column_name [, ...]) ]
[ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] ];Parameter description
Parameter | Description |
table_name | The name of the table whose statistics you want to collect. The value can be specified in the |
column_name | The name of the column whose statistics you want to collect. The column must be in the table specified by the |
sync | If this parameter is specified, the job to collect statistics is executed in synchronous mode and the execution result is returned after the job is completed. If this parameter is not specified, the job to collect statistics is executed in asynchronous mode and the job ID is returned. |
sample percent | rows | You can specify this parameter to sample the information for statistics collection. You can specify the sampling ratio or the number of rows to be sampled. |
Examples
Collect statistics for a table at a sampling ratio of 10%:
ANALYZE TABLE lineitem WITH SAMPLE PERCENT 10;Collect statistics for a table by sampling 100,000 rows of data:
ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;
Automatic collection
By default, the automatic collection feature is enabled.
Execution logic
After an import transaction is committed, ApsaraDB for SelectDB records the number of table rows updated by the import transaction to estimate the statistics health of the table. For tables whose statistics is not collected, their statistics health is 0. You can specify the table_stats_health_threshold parameter to set the statistic health threshold. In this example, 60 is used. If the statistics health of a table is lower than 60, ApsaraDB for SelectDB considers that the statistics for the table is outdated and triggers a statistics collection job for the table in subsequent operations. For a table whose statistics health is higher than 60, no repeated collection is performed.
Parameter configuration
The statistics collection jobs consumes a certain amount of system resources. To minimize the overhead, ApsaraDB for SelectDB automatically uses sampling to collect statistics for tables whose size is larger than the specified threshold. By default, the threshold is set to 5 GiB. You can specify the huge_table_lower_bound_size_in_bytes frontend (FE) parameter to adjust the threshold. By default, ApsaraDB for SelectDB samples 4,194,304 (2^22) rows of data during automatic collection to reduce the workloads on the system and complete the statistics collection job at the earliest opportunity.
If you want to sample more rows to obtain more accurate data distribution information, you can specify the huge_table_default_sample_rows parameter to increase the number of rows to be sampled. In addition, ApsaraDB for SelectDB ensures that the collection interval is not less than a specific period of time for tables whose size is larger than the threshold specified for the huge_table_lower_bound_size_in_bytes parameter. By default, the minimum interval is 12 hours. You can specify the huge_table_auto_analyze_interval_in_millis parameter to adjust the minimum interval.
To prevent your business from being affected by the execution of automatic collection jobs, you can specify the auto_analyze_start_time and auto_analyze_end_time parameters based on your business requirements to specify that the automatic collection jobs are executed during off-peak hours. You can also set the enable_auto_analyze parameter to false to disable the automatic collection feature.
By default, the automatic collection feature is disabled for the external catalog. If the automatic collection feature is enabled for the external catalog, excessive resources may be consumed because the external catalog contains large amounts of historical data. You can configure the properties of the external catalog to enable or disable the automatic collection feature for the external catalog.
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='true'); // Enable the automatic collection feature for the external catalog.
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='false'); // Disable the automatic collection feature for the external catalog.Job management
View statistics collection jobs
You can execute the SHOW ANALYZE statement to view the information about statistics collection jobs.
Syntax
SHOW [AUTO] ANALYZE < table_name | job_id >
[ WHERE [ STATE = [ "PENDING" | "RUNNING" | "FINISHED" | "FAILED" ] ] ];Parameter description
Parameter | Description |
AUTO | The information about historical automatic collection jobs. Note By default, the status of only the latest 20,000 automatic collection jobs that are complete is retained. |
table_name | The name of the table whose statistics collection jobs you want to view. The value can be specified in the |
job_id | The ID of the statistics collection job that you want to view. You can obtain the job ID from the output of the |
Example
SHOW ANALYZE 245073\G;
*************************** 1. row ***************************
job_id: 245073
catalog_name: internal
db_name: default_cluster:tpch
tbl_name: lineitem
col_name: [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
job_type: MANUAL
analysis_type: FUNDAMENTALS
message:
last_exec_time_in_ms: 2023-11-07 11:00:52
state: FINISHED
progress: 16 Finished | 0 Failed | 0 In Progress | 16 Total
schedule_type: ONCEThe following table describes the parameters in the output.
Parameter | Description |
job_id | The ID of the statistics collection job. |
catalog_name | The catalog name. |
db_name | The database name. |
tbl_name | The table name. |
col_name | The column name. |
job_type | The type of the job. |
analysis_type | The type of the statistics. |
message | The information about the statistics collection job. |
last_exec_time_in_ms | The time when the statistics collection job was last executed. |
state | The status of the statistics collection job. |
schedule_type | The scheduling method of the statistics collection job. |
View the information about the statistics collection of a table
You can execute the SHOW TABLE STATS statement to view the information about the statistics collection of a table.
Syntax
SHOW TABLE STATS <table_name>;Parameter description
Parameter | Description |
table_name | The name of the table whose statistics information you want to view. The value can be specified in the |
Example
The following statement shows an example on how to view the information about the collected statistics of the lineitem table.
SHOW TABLE STATS lineitem\G
*************************** 1. row ***************************
updated_rows: 0
query_times: 0
row_count: 6001215
updated_time: 2023-11-07
columns: [l_returnflag, l_receiptdate, l_tax, l_shipmode, l_suppkey, l_shipdate, l_commitdate, l_partkey, l_orderkey, l_quantity, l_linestatus, l_comment, l_extendedprice, l_linenumber, l_discount, l_shipinstruct]
trigger: MANUALThe following table describes the parameters in the output.
Parameter | Description |
updated_rows | The number of table rows that are updated by the last ANALYZE statement. |
query_times | The reserved column. In later versions, this parameter indicates the number of queries performed on the table. |
row_count | The number of table rows. The value of this parameter does not indicate the exact number of rows during execution. |
updated_time | The time when the collected statistics were last updated. |
columns | The columns from which statistics are collected. |
trigger | The trigger method of the statistics collection job. |
View the information about the statistics collection of a column
Each statistics collection job consists of one or more tasks, and each task corresponds to the collection of a column. You can execute the following statement to view the information about the statistics collection of a specific column.
Syntax
SHOW ANALYZE TASK STATUS [job_id]Parameter description
Parameter | Description |
job_id | The ID of the statistics collection job. |
Example
The following statement shows an example on how to view the information about the statistics collection task whose ID is 20038.
SHOW ANALYZE TASK STATUS 20038;
+---------+----------+---------+----------------------+----------+
| task_id | col_name | message | last_exec_time_in_ms | state |
+---------+----------+---------+----------------------+----------+
| 20039 | col4 | | 2023-06-01 17:22:15 | FINISHED |
| 20040 | col2 | | 2023-06-01 17:22:15 | FINISHED |
| 20041 | col3 | | 2023-06-01 17:22:15 | FINISHED |
| 20042 | col1 | | 2023-06-01 17:22:15 | FINISHED |
+---------+----------+---------+----------------------+----------+View the statistics of a column
You can execute the SHOW COLUMN STATS statement to view the statistics of a column.
Syntax
SHOW COLUMN [cached] STATS table_name [ (column_name [, ...]) ];Parameter description
Parameter | Description |
cached | The statistics in the FE memory cache. |
table_name | The name of the table whose column statistics you want to view. The value can be specified in the |
column_name | The name of the column whose statistics you want to view. The column must be in the table specified by the |
Example
The following statement shows an example on how to view the statistics of the l_tax column.
SHOW COLUMN STATS lineitem(l_tax)\G
*************************** 1. row ***************************
column_name: l_tax
count: 6001215.0
ndv: 9.0
num_null: 0.0
data_size: 4.800972E7
avg_size_byte: 8.0
min: 0.00
max: 0.08
method: FULL
type: FUNDAMENTALS
trigger: MANUAL
query_times: 0
updated_time: 2023-11-07 11:00:46Stop a statistics collection job
You can execute the KILL ANALYZE statement to stop a statistics collection job.
Syntax
KILL ANALYZE job_id;Parameter description
Parameter | Description |
job_id | The ID of the statistics collection job. You can obtain the job ID from the output of the |
Example
The following statement shows an example on how to stop the statistics collection job whose ID is 52357.
KILL ANALYZE 52357;Session variables and FE configuration items
Session variables
Session variable | Default value | Description |
auto_analyze_start_time | 00:00:00 | The start time of the automatic collection job. |
auto_analyze_end_time | 23:59:59 | The end time of the automatic collection job. |
enable_auto_analyze | true | Specifies whether to enable the automatic collection feature. |
huge_table_default_sample_rows | 4194304 | The number of rows to be sampled for large tables. |
huge_table_lower_bound_size_in_bytes | 5368709120 | The threshold for sampling collection. If the size of a table exceeds this value, sampling collection is automatically used during automatic collection. |
huge_table_auto_analyze_interval_in_millis | 43200000 | The minimum interval at which automatic collection is performed for large tables. During the specified interval, automatic collection is performed only once for tables whose size is larger than the threshold specified for the |
table_stats_health_threshold | 60 | The statistics health threshold. Valid values: 0 to 100. If the ratio of updated data reaches the value of |
analyze_timeout | 43200 | The timeout period of the statistics collection job. Unit: seconds. |
auto_analyze_table_width_threshold | 70 | The maximum number of columns that a table used for automatic collection can contain. If the number of columns in a table exceeds the value, the table does not participate in automatic collection. |
FE configuration items
The following FE configuration items are related to the statistics to be collected. In most cases, you do not need to focus on these configuration items.
FE configuration item | Default value | Description |
analyze_record_limit | 20000 | The maximum number of persistent rows that can be saved to record the execution of statistics collection jobs. |
stats_cache_size | 500000 | The maximum number of statistics rows that can be cached on the FE side. |
statistics_simultaneously_running_task_num | 3 | The maximum number of asynchronous statistics collection jobs that can be concurrently executed. |
statistics_sql_mem_limit_in_bytes | 2,147,483,648 bytes (2 GiB) | The maximum backend (BE) memory that can be occupied by each SQL statement for statistic collection. |
FAQ
Q: What do I do if the "Stats table not available..." error is reported after I submit the ANALYZE statement?
A: Check the status of the BE.
You can execute the
SHOW BACKENDSstatement to check whether the status of the BE is normal. If the status of the BE is normal, you can execute theADMIN SHOW REPLICA STATUS FROM __internal_schema.[tbl_in_this_db]statement to check the status of tablets in the database. Make sure that the status of all tablets is normal. During the execution of the ANALYZE statement, the collected statistics are written to the internal table named__internal_schema.column_statistics. Before the execution of the ANALYZE statement, the FE checks the tablet status of the table. If an unavailable tablet exists, the request to collect statistics is rejected.Q: What do I do if the statistics fail to be collected from a large table?
A: We recommend that you execute the
ANALYZE ... WITH SAMPLE...statement to collect statistics from the large table. The resources that can be consumed by the ANALYZE statement are strictly limited. In this case, if you execute the ANALYZE statement on large tables, the execution may time out or be limited by the available BE memory.