All Products
Search
Document Center

ApsaraDB for SelectDB:Statistics

Last Updated:May 09, 2024

ApsaraDB for SelectDB allows you to manually trigger the collection of statistics or enable automatic collection of statistics. You can optimize query performance based on the collected statistics.

Overview

ApsaraDB for SelectDB supports manual and automatic collection of statistics. The collected statistics helps the optimizer understand data distribution characteristics. During cost-based optimization (CBO), the optimizer calculates the selectivity of predicates and estimates the cost of each execution plan based on the collected statistics. This way, the optimizer can select optimal query plans to greatly improve query efficiency.

Statistics to be collected

In most cases, the following information is collected for each column.

Information

Description

row_count

The total number of rows.

data_size

The total amount of data.

avg_size_byte

The average length of values.

ndv

The number of different values.

min

The minimum value.

max

The maximum value.

null_count

The number of null values.

Statistics collection

ApsaraDB for SelectDB allows you to execute the ANALYZE statement to manually collect and update statistics.

Syntax

ANALYZE < TABLE | DATABASE table_name | db_name > 
    [ (column_name [, ...]) ]
    [ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] ];

Parameter description

Parameter

Description

table_name

The name of the table whose statistics you want to collect. The value can be specified in the database_name.table_name format.

column_name

The name of the column whose statistics you want to collect. The column must be in the table specified by the table_name parameter. You can specify multiple column names. Separate multiple column names with commas (,).

sync

If this parameter is specified, the job to collect statistics is executed in synchronous mode and the execution result is returned after the job is completed. If this parameter is not specified, the job to collect statistics is executed in asynchronous mode and the job ID is returned.

sample percent | rows

You can specify this parameter to sample the information for statistics collection. You can specify the sampling ratio or the number of rows to be sampled.

Examples

  • Collect statistics for a table at a sampling ratio of 10%:

    ANALYZE TABLE lineitem WITH SAMPLE PERCENT 10;
  • Collect statistics for a table by sampling 100,000 rows of data:

    ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;

Automatic collection

By default, the automatic collection feature is enabled.

Execution logic

After an import transaction is committed, ApsaraDB for SelectDB records the number of table rows updated by the import transaction to estimate the statistics health of the table. For tables whose statistics is not collected, their statistics health is 0. You can specify the table_stats_health_threshold parameter to set the statistic health threshold. In this example, 60 is used. If the statistics health of a table is lower than 60, ApsaraDB for SelectDB considers that the statistics for the table is outdated and triggers a statistics collection job for the table in subsequent operations. For a table whose statistics health is higher than 60, no repeated collection is performed.

Parameter configuration

The statistics collection jobs consumes a certain amount of system resources. To minimize the overhead, ApsaraDB for SelectDB automatically uses sampling to collect statistics for tables whose size is larger than the specified threshold. By default, the threshold is set to 5 GiB. You can specify the huge_table_lower_bound_size_in_bytes frontend (FE) parameter to adjust the threshold. By default, ApsaraDB for SelectDB samples 4,194,304 (2^22) rows of data during automatic collection to reduce the workloads on the system and complete the statistics collection job at the earliest opportunity.

If you want to sample more rows to obtain more accurate data distribution information, you can specify the huge_table_default_sample_rows parameter to increase the number of rows to be sampled. In addition, ApsaraDB for SelectDB ensures that the collection interval is not less than a specific period of time for tables whose size is larger than the threshold specified for the huge_table_lower_bound_size_in_bytes parameter. By default, the minimum interval is 12 hours. You can specify the huge_table_auto_analyze_interval_in_millis parameter to adjust the minimum interval.

To prevent your business from being affected by the execution of automatic collection jobs, you can specify the auto_analyze_start_time and auto_analyze_end_time parameters based on your business requirements to specify that the automatic collection jobs are executed during off-peak hours. You can also set the enable_auto_analyze parameter to false to disable the automatic collection feature.

By default, the automatic collection feature is disabled for the external catalog. If the automatic collection feature is enabled for the external catalog, excessive resources may be consumed because the external catalog contains large amounts of historical data. You can configure the properties of the external catalog to enable or disable the automatic collection feature for the external catalog.

ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='true'); // Enable the automatic collection feature for the external catalog.
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='false'); // Disable the automatic collection feature for the external catalog.

Job management

View statistics collection jobs

You can execute the SHOW ANALYZE statement to view the information about statistics collection jobs.

Syntax

SHOW [AUTO] ANALYZE < table_name | job_id >
    [ WHERE [ STATE = [ "PENDING" | "RUNNING" | "FINISHED" | "FAILED" ] ] ];

Parameter description

Parameter

Description

AUTO

The information about historical automatic collection jobs.

Note

By default, the status of only the latest 20,000 automatic collection jobs that are complete is retained.

table_name

The name of the table whose statistics collection jobs you want to view. The value can be specified in the database_name.table_name format. If you specify this parameter, the information about the statistics collection jobs that are executed on the specified table is returned. If you do not specify this parameter, the information about all statistics collection jobs is returned.

job_id

The ID of the statistics collection job that you want to view. You can obtain the job ID from the output of the ANALYZE statement executed to collect statistics in asynchronous mode. If you do not specify this parameter, the information about all statistics collection jobs is returned.

Example

SHOW ANALYZE 245073\G;
*************************** 1. row ***************************
              job_id: 245073
        catalog_name: internal
             db_name: default_cluster:tpch
            tbl_name: lineitem
            col_name: [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
            job_type: MANUAL
       analysis_type: FUNDAMENTALS
             message: 
last_exec_time_in_ms: 2023-11-07 11:00:52
               state: FINISHED
            progress: 16 Finished  |  0 Failed  |  0 In Progress  |  16 Total
       schedule_type: ONCE

The following table describes the parameters in the output.

Parameter

Description

job_id

The ID of the statistics collection job.

catalog_name

The catalog name.

db_name

The database name.

tbl_name

The table name.

col_name

The column name.

job_type

The type of the job.

analysis_type

The type of the statistics.

message

The information about the statistics collection job.

last_exec_time_in_ms

The time when the statistics collection job was last executed.

state

The status of the statistics collection job.

schedule_type

The scheduling method of the statistics collection job.

View the information about the statistics collection of a table

You can execute the SHOW TABLE STATS statement to view the information about the statistics collection of a table.

Syntax

SHOW TABLE STATS <table_name>;

Parameter description

Parameter

Description

table_name

The name of the table whose statistics information you want to view. The value can be specified in the database_name.table_name format.

Example

The following statement shows an example on how to view the information about the collected statistics of the lineitem table.

SHOW TABLE STATS lineitem\G
*************************** 1. row ***************************
updated_rows: 0
 query_times: 0
   row_count: 6001215
updated_time: 2023-11-07
     columns: [l_returnflag, l_receiptdate, l_tax, l_shipmode, l_suppkey, l_shipdate, l_commitdate, l_partkey, l_orderkey, l_quantity, l_linestatus, l_comment, l_extendedprice, l_linenumber, l_discount, l_shipinstruct]
     trigger: MANUAL

The following table describes the parameters in the output.

Parameter

Description

updated_rows

The number of table rows that are updated by the last ANALYZE statement.

query_times

The reserved column. In later versions, this parameter indicates the number of queries performed on the table.

row_count

The number of table rows. The value of this parameter does not indicate the exact number of rows during execution.

updated_time

The time when the collected statistics were last updated.

columns

The columns from which statistics are collected.

trigger

The trigger method of the statistics collection job.

View the information about the statistics collection of a column

Each statistics collection job consists of one or more tasks, and each task corresponds to the collection of a column. You can execute the following statement to view the information about the statistics collection of a specific column.

Syntax

SHOW ANALYZE TASK STATUS [job_id]

Parameter description

Parameter

Description

job_id

The ID of the statistics collection job.

Example

The following statement shows an example on how to view the information about the statistics collection task whose ID is 20038.

SHOW ANALYZE TASK STATUS 20038;
+---------+----------+---------+----------------------+----------+
| task_id | col_name | message | last_exec_time_in_ms | state    |
+---------+----------+---------+----------------------+----------+
| 20039   | col4     |         | 2023-06-01 17:22:15  | FINISHED |
| 20040   | col2     |         | 2023-06-01 17:22:15  | FINISHED |
| 20041   | col3     |         | 2023-06-01 17:22:15  | FINISHED |
| 20042   | col1     |         | 2023-06-01 17:22:15  | FINISHED |
+---------+----------+---------+----------------------+----------+

View the statistics of a column

You can execute the SHOW COLUMN STATS statement to view the statistics of a column.

Syntax

SHOW COLUMN [cached] STATS table_name [ (column_name [, ...]) ];

Parameter description

Parameter

Description

cached

The statistics in the FE memory cache.

table_name

The name of the table whose column statistics you want to view. The value can be specified in the database_name.table_name format.

column_name

The name of the column whose statistics you want to view. The column must be in the table specified by the table_name parameter. You can specify multiple column names. Separate multiple column names with commas (,).

Example

The following statement shows an example on how to view the statistics of the l_tax column.

SHOW COLUMN STATS lineitem(l_tax)\G
*************************** 1. row ***************************
  column_name: l_tax
        count: 6001215.0
          ndv: 9.0
     num_null: 0.0
    data_size: 4.800972E7
avg_size_byte: 8.0
          min: 0.00
          max: 0.08
       method: FULL
         type: FUNDAMENTALS
      trigger: MANUAL
  query_times: 0
 updated_time: 2023-11-07 11:00:46

Stop a statistics collection job

You can execute the KILL ANALYZE statement to stop a statistics collection job.

Syntax

KILL ANALYZE job_id;

Parameter description

Parameter

Description

job_id

The ID of the statistics collection job. You can obtain the job ID from the output of the ANALYZE statement executed to collect statistics in asynchronous mode. You can also obtain the job ID by executing the SHOW ANALYZE statement.

Example

The following statement shows an example on how to stop the statistics collection job whose ID is 52357.

KILL ANALYZE 52357;

Session variables and FE configuration items

Session variables

Session variable

Default value

Description

auto_analyze_start_time

00:00:00

The start time of the automatic collection job.

auto_analyze_end_time

23:59:59

The end time of the automatic collection job.

enable_auto_analyze

true

Specifies whether to enable the automatic collection feature.

huge_table_default_sample_rows

4194304

The number of rows to be sampled for large tables.

huge_table_lower_bound_size_in_bytes

5368709120

The threshold for sampling collection. If the size of a table exceeds this value, sampling collection is automatically used during automatic collection.

huge_table_auto_analyze_interval_in_millis

43200000

The minimum interval at which automatic collection is performed for large tables. During the specified interval, automatic collection is performed only once for tables whose size is larger than the threshold specified for the huge_table_lower_bound_size_in_bytes parameter.

table_stats_health_threshold

60

The statistics health threshold. Valid values: 0 to 100. If the ratio of updated data reaches the value of (100 - the value of the table_stats_health_threshold parameter) × %, the statistics are considered to be outdated.

analyze_timeout

43200

The timeout period of the statistics collection job. Unit: seconds.

auto_analyze_table_width_threshold

70

The maximum number of columns that a table used for automatic collection can contain. If the number of columns in a table exceeds the value, the table does not participate in automatic collection.

FE configuration items

The following FE configuration items are related to the statistics to be collected. In most cases, you do not need to focus on these configuration items.

FE configuration item

Default value

Description

analyze_record_limit

20000

The maximum number of persistent rows that can be saved to record the execution of statistics collection jobs.

stats_cache_size

500000

The maximum number of statistics rows that can be cached on the FE side.

statistics_simultaneously_running_task_num

3

The maximum number of asynchronous statistics collection jobs that can be concurrently executed.

statistics_sql_mem_limit_in_bytes

2,147,483,648 bytes (2 GiB)

The maximum backend (BE) memory that can be occupied by each SQL statement for statistic collection.

FAQ

  • Q: What do I do if the "Stats table not available..." error is reported after I submit the ANALYZE statement?

    A: Check the status of the BE.

    You can execute the SHOW BACKENDS statement to check whether the status of the BE is normal. If the status of the BE is normal, you can execute the ADMIN SHOW REPLICA STATUS FROM __internal_schema.[tbl_in_this_db] statement to check the status of tablets in the database. Make sure that the status of all tablets is normal. During the execution of the ANALYZE statement, the collected statistics are written to the internal table named __internal_schema.column_statistics. Before the execution of the ANALYZE statement, the FE checks the tablet status of the table. If an unavailable tablet exists, the request to collect statistics is rejected.

  • Q: What do I do if the statistics fail to be collected from a large table?

    A: We recommend that you execute the ANALYZE ... WITH SAMPLE... statement to collect statistics from the large table. The resources that can be consumed by the ANALYZE statement are strictly limited. In this case, if you execute the ANALYZE statement on large tables, the execution may time out or be limited by the available BE memory.