All Products
Search
Document Center

ApsaraDB for SelectDB:Statistics

Last Updated:Mar 28, 2026

ApsaraDB for SelectDB collects per-column statistics to help the cost-based optimizer (CBO) choose efficient query plans. You can trigger collection manually or let the system collect statistics automatically.

How it works

During query optimization, the CBO uses statistics to estimate predicate selectivity and compare execution plan costs. Accurate statistics lead to better plan selection and faster queries.

All collected statistics are written to the internal table __internal_schema.column_statistics. Before running any collection job, the frontend (FE) checks that all tablets in this table are available. If an unavailable tablet is detected, the job is rejected.

Statistics collected per column

SelectDB collects the following statistics for each column:

FieldDescription
row_countTotal number of rows
data_sizeTotal data size
avg_size_byteAverage value length in bytes
ndvNumber of distinct values
minMinimum value
maxMaximum value
null_countNumber of null values

Choose a collection method

Two collection methods are available. Choose based on table size and the accuracy requirements of your workload.

MethodHow it worksTradeoff
Full collectionScans the entire tableMost accurate; higher resource cost and slower
Sampled collectionScans a subset of rows or a percentage of the tableFaster and lighter on resources; slightly less accurate

For tables larger than 5 GiB, use sampled collection to avoid timeouts and excessive backend (BE) memory usage.

Collect statistics manually

Run the ANALYZE statement to collect or refresh statistics on demand.

Syntax

ANALYZE < TABLE | DATABASE table_name | db_name >
    [ (column_name [, ...]) ]
    [ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] ];

Parameters

ParameterDescription
table_nameThe table to analyze. Use the database_name.table_name format to specify the database.
column_nameOne or more columns to analyze. Separate multiple column names with commas.
WITH SYNCRun the job synchronously and return the result after completion. Without this option, the job runs asynchronously and returns a job ID.
WITH SAMPLE PERCENT | ROWSUse sampled collection. Specify either a sampling ratio (percent) or a fixed number of rows.

Examples

Collect statistics by sampling 10% of rows:

ANALYZE TABLE lineitem WITH SAMPLE PERCENT 10;

Collect statistics by sampling 100,000 rows:

ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;

Configure automatic collection

Automatic collection is enabled by default. After each import transaction commits, SelectDB recalculates the statistics health of the affected tables and triggers collection jobs as needed.

How health score works

Statistics health is a value from 0 to 100. Automatic collection is triggered when all of the following conditions are met:

  • The table has been updated since the last collection.

  • The health score has dropped below the configured threshold.

  • For large tables: the minimum collection interval has elapsed since the last collection.

Health score details:

  • A table with no statistics has a health score of 0.

  • After each import transaction, SelectDB estimates the new health based on the proportion of rows updated.

  • If health falls below table_stats_health_threshold (default: 60), the table is considered stale and scheduled for collection.

Large table strategy

For tables larger than huge_table_lower_bound_size_in_bytes (default: 5 GiB):

  • SelectDB uses sampled collection automatically, sampling huge_table_default_sample_rows rows (default: 4,194,304).

  • Collection runs at most once per huge_table_auto_analyze_interval_in_millis (default: 12 hours), regardless of health score changes within that window.

To increase sampling depth for more accurate statistics, raise huge_table_default_sample_rows.

Limit collection to off-peak hours

To avoid impacting production workloads, set a time window for automatic collection:

SET auto_analyze_start_time = '02:00:00';
SET auto_analyze_end_time = '06:00:00';

Disable automatic collection

SET enable_auto_analyze = false;

External catalog

Automatic collection is disabled by default for external catalogs to prevent excessive resource consumption from large historical datasets. Enable or disable it per catalog:

-- Enable automatic collection for an external catalog
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='true');

-- Disable automatic collection for an external catalog
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='false');

Manage collection jobs

View collection jobs

SHOW [AUTO] ANALYZE < table_name | job_id >
    [ WHERE [ STATE = [ "PENDING" | "RUNNING" | "FINISHED" | "FAILED" ] ] ];
ParameterDescription
AUTOShow historical automatic collection jobs. By default, only the latest 20,000 completed automatic jobs are retained.
table_nameFilter by table. Use the database_name.table_name format. Returns all jobs if omitted.
job_idFilter by job ID. The job ID is returned when you run ANALYZE in asynchronous mode. Returns all jobs if omitted.

Example:

SHOW ANALYZE 245073\G;
*************************** 1. row ***************************
              job_id: 245073
        catalog_name: internal
             db_name: default_cluster:tpch
            tbl_name: lineitem
            col_name: [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,
                        l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,
                        l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
            job_type: MANUAL
       analysis_type: FUNDAMENTALS
             message:
last_exec_time_in_ms: 2023-11-07 11:00:52
               state: FINISHED
            progress: 16 Finished  |  0 Failed  |  0 In Progress  |  16 Total
       schedule_type: ONCE

Output fields:

FieldDescription
job_idJob ID
catalog_nameCatalog name
db_nameDatabase name
tbl_nameTable name
col_nameColumns analyzed
job_typeJob type: MANUAL or AUTO
analysis_typeStatistics type
messageThe information about the statistics collection job.
last_exec_time_in_msLast execution timestamp
stateJob state: PENDING, RUNNING, FINISHED, or FAILED
progressTask completion breakdown
schedule_typeScheduling method: ONCE for one-time jobs

View table-level statistics

SHOW TABLE STATS <table_name>;

Example:

SHOW TABLE STATS lineitem\G
*************************** 1. row ***************************
updated_rows: 0
 query_times: 0
   row_count: 6001215
updated_time: 2023-11-07
     columns: [l_returnflag, l_receiptdate, l_tax, l_shipmode, l_suppkey, l_shipdate,
               l_commitdate, l_partkey, l_orderkey, l_quantity, l_linestatus, l_comment,
               l_extendedprice, l_linenumber, l_discount, l_shipinstruct]
     trigger: MANUAL

Output fields:

FieldDescription
updated_rowsThe number of table rows that are updated by the last ANALYZE statement.
query_timesReserved; will indicate query count in a later version
row_countThe number of table rows. The value of this parameter does not indicate the exact number of rows during execution.
updated_timeWhen the statistics were last updated
columnsColumns with collected statistics
triggerHow the last collection was triggered

View column-level task status

Each collection job spawns one task per column. To view task status for a specific job:

SHOW ANALYZE TASK STATUS [job_id]

Example:

SHOW ANALYZE TASK STATUS 20038;
+---------+----------+---------+----------------------+----------+
| task_id | col_name | message | last_exec_time_in_ms | state    |
+---------+----------+---------+----------------------+----------+
| 20039   | col4     |         | 2023-06-01 17:22:15  | FINISHED |
| 20040   | col2     |         | 2023-06-01 17:22:15  | FINISHED |
| 20041   | col3     |         | 2023-06-01 17:22:15  | FINISHED |
| 20042   | col1     |         | 2023-06-01 17:22:15  | FINISHED |
+---------+----------+---------+----------------------+----------+

View column statistics

SHOW COLUMN [cached] STATS table_name [ (column_name [, ...]) ];
ParameterDescription
cachedShow statistics currently cached in FE memory
table_nameThe table to inspect. Use the database_name.table_name format.
column_nameOne or more columns. Separate multiple column names with commas.

Example:

SHOW COLUMN STATS lineitem(l_tax)\G
*************************** 1. row ***************************
  column_name: l_tax
        count: 6001215.0
          ndv: 9.0
     num_null: 0.0
    data_size: 4.800972E7
avg_size_byte: 8.0
          min: 0.00
          max: 0.08
       method: FULL
         type: FUNDAMENTALS
      trigger: MANUAL
  query_times: 0
 updated_time: 2023-11-07 11:00:46

Stop a collection job

KILL ANALYZE job_id;

The job ID is returned by ANALYZE in asynchronous mode, or retrieved from SHOW ANALYZE.

Example:

KILL ANALYZE 52357;

Session variables and FE configuration items

Session variables

VariableDefaultDescription
auto_analyze_start_time00:00:00Start time of the automatic collection window
auto_analyze_end_time23:59:59End time of the automatic collection window
enable_auto_analyzetrueEnable or disable automatic collection
huge_table_default_sample_rows4194304Number of rows sampled for large tables
huge_table_lower_bound_size_in_bytes5368709120Size threshold above which a table is treated as large (default: 5 GiB)
huge_table_auto_analyze_interval_in_millis43200000Minimum interval between automatic collections for large tables (default: 12 hours)
table_stats_health_threshold60Health score threshold (0–100). A table is considered stale when its health falls below this value.
analyze_timeout43200Timeout for a collection job, in seconds
auto_analyze_table_width_threshold70Maximum number of columns a table can have to participate in automatic collection

FE configuration items

These items control background behavior. In most cases, the defaults are sufficient.

FE configuration itemDefaultDescription
analyze_record_limit20000Maximum number of job execution records stored persistently
stats_cache_size500000Maximum number of statistics rows cached on the FE side
statistics_simultaneously_running_task_num3Maximum number of concurrent asynchronous collection jobs
statistics_sql_mem_limit_in_bytes2147483648Maximum BE memory per SQL statement used for collection (default: 2 GiB)

FAQ

The "Stats table not available..." error appears after running ANALYZE

Run SHOW BACKENDS to confirm that all BE nodes are in a normal state. If the BEs look healthy, check the tablet status of the internal statistics table:

ADMIN SHOW REPLICA STATUS FROM __internal_schema.[tbl_in_this_db];

Make sure all tablets show a normal status. The FE checks tablet availability before accepting any ANALYZE request and rejects it if any tablet in __internal_schema.column_statistics is unavailable.

Statistics collection fails on a large table

Use sampled collection instead of a full scan. The resources available to ANALYZE are strictly limited, and a full scan of a large table can time out or exhaust BE memory:

ANALYZE TABLE <table_name> WITH SAMPLE PERCENT 10;

Adjust the sampling ratio based on table size and the accuracy you need.