ApsaraDB for SelectDB collects per-column statistics to help the cost-based optimizer (CBO) choose efficient query plans. You can trigger collection manually or let the system collect statistics automatically.
How it works
During query optimization, the CBO uses statistics to estimate predicate selectivity and compare execution plan costs. Accurate statistics lead to better plan selection and faster queries.
All collected statistics are written to the internal table __internal_schema.column_statistics. Before running any collection job, the frontend (FE) checks that all tablets in this table are available. If an unavailable tablet is detected, the job is rejected.
Statistics collected per column
SelectDB collects the following statistics for each column:
| Field | Description |
|---|---|
row_count | Total number of rows |
data_size | Total data size |
avg_size_byte | Average value length in bytes |
ndv | Number of distinct values |
min | Minimum value |
max | Maximum value |
null_count | Number of null values |
Choose a collection method
Two collection methods are available. Choose based on table size and the accuracy requirements of your workload.
| Method | How it works | Tradeoff |
|---|---|---|
| Full collection | Scans the entire table | Most accurate; higher resource cost and slower |
| Sampled collection | Scans a subset of rows or a percentage of the table | Faster and lighter on resources; slightly less accurate |
For tables larger than 5 GiB, use sampled collection to avoid timeouts and excessive backend (BE) memory usage.
Collect statistics manually
Run the ANALYZE statement to collect or refresh statistics on demand.
Syntax
ANALYZE < TABLE | DATABASE table_name | db_name >
[ (column_name [, ...]) ]
[ [ WITH SYNC ] [ WITH SAMPLE PERCENT | ROWS ] ];Parameters
| Parameter | Description |
|---|---|
table_name | The table to analyze. Use the database_name.table_name format to specify the database. |
column_name | One or more columns to analyze. Separate multiple column names with commas. |
WITH SYNC | Run the job synchronously and return the result after completion. Without this option, the job runs asynchronously and returns a job ID. |
WITH SAMPLE PERCENT | ROWS | Use sampled collection. Specify either a sampling ratio (percent) or a fixed number of rows. |
Examples
Collect statistics by sampling 10% of rows:
ANALYZE TABLE lineitem WITH SAMPLE PERCENT 10;Collect statistics by sampling 100,000 rows:
ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;Configure automatic collection
Automatic collection is enabled by default. After each import transaction commits, SelectDB recalculates the statistics health of the affected tables and triggers collection jobs as needed.
How health score works
Statistics health is a value from 0 to 100. Automatic collection is triggered when all of the following conditions are met:
The table has been updated since the last collection.
The health score has dropped below the configured threshold.
For large tables: the minimum collection interval has elapsed since the last collection.
Health score details:
A table with no statistics has a health score of 0.
After each import transaction, SelectDB estimates the new health based on the proportion of rows updated.
If health falls below
table_stats_health_threshold(default: 60), the table is considered stale and scheduled for collection.
Large table strategy
For tables larger than huge_table_lower_bound_size_in_bytes (default: 5 GiB):
SelectDB uses sampled collection automatically, sampling
huge_table_default_sample_rowsrows (default: 4,194,304).Collection runs at most once per
huge_table_auto_analyze_interval_in_millis(default: 12 hours), regardless of health score changes within that window.
To increase sampling depth for more accurate statistics, raise huge_table_default_sample_rows.
Limit collection to off-peak hours
To avoid impacting production workloads, set a time window for automatic collection:
SET auto_analyze_start_time = '02:00:00';
SET auto_analyze_end_time = '06:00:00';Disable automatic collection
SET enable_auto_analyze = false;External catalog
Automatic collection is disabled by default for external catalogs to prevent excessive resource consumption from large historical datasets. Enable or disable it per catalog:
-- Enable automatic collection for an external catalog
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='true');
-- Disable automatic collection for an external catalog
ALTER CATALOG external_catalog SET PROPERTIES ('enable.auto.analyze'='false');Manage collection jobs
View collection jobs
SHOW [AUTO] ANALYZE < table_name | job_id >
[ WHERE [ STATE = [ "PENDING" | "RUNNING" | "FINISHED" | "FAILED" ] ] ];| Parameter | Description |
|---|---|
AUTO | Show historical automatic collection jobs. By default, only the latest 20,000 completed automatic jobs are retained. |
table_name | Filter by table. Use the database_name.table_name format. Returns all jobs if omitted. |
job_id | Filter by job ID. The job ID is returned when you run ANALYZE in asynchronous mode. Returns all jobs if omitted. |
Example:
SHOW ANALYZE 245073\G;*************************** 1. row ***************************
job_id: 245073
catalog_name: internal
db_name: default_cluster:tpch
tbl_name: lineitem
col_name: [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,
l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,
l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
job_type: MANUAL
analysis_type: FUNDAMENTALS
message:
last_exec_time_in_ms: 2023-11-07 11:00:52
state: FINISHED
progress: 16 Finished | 0 Failed | 0 In Progress | 16 Total
schedule_type: ONCEOutput fields:
| Field | Description |
|---|---|
job_id | Job ID |
catalog_name | Catalog name |
db_name | Database name |
tbl_name | Table name |
col_name | Columns analyzed |
job_type | Job type: MANUAL or AUTO |
analysis_type | Statistics type |
message | The information about the statistics collection job. |
last_exec_time_in_ms | Last execution timestamp |
state | Job state: PENDING, RUNNING, FINISHED, or FAILED |
progress | Task completion breakdown |
schedule_type | Scheduling method: ONCE for one-time jobs |
View table-level statistics
SHOW TABLE STATS <table_name>;Example:
SHOW TABLE STATS lineitem\G*************************** 1. row ***************************
updated_rows: 0
query_times: 0
row_count: 6001215
updated_time: 2023-11-07
columns: [l_returnflag, l_receiptdate, l_tax, l_shipmode, l_suppkey, l_shipdate,
l_commitdate, l_partkey, l_orderkey, l_quantity, l_linestatus, l_comment,
l_extendedprice, l_linenumber, l_discount, l_shipinstruct]
trigger: MANUALOutput fields:
| Field | Description |
|---|---|
updated_rows | The number of table rows that are updated by the last ANALYZE statement. |
query_times | Reserved; will indicate query count in a later version |
row_count | The number of table rows. The value of this parameter does not indicate the exact number of rows during execution. |
updated_time | When the statistics were last updated |
columns | Columns with collected statistics |
trigger | How the last collection was triggered |
View column-level task status
Each collection job spawns one task per column. To view task status for a specific job:
SHOW ANALYZE TASK STATUS [job_id]Example:
SHOW ANALYZE TASK STATUS 20038;+---------+----------+---------+----------------------+----------+
| task_id | col_name | message | last_exec_time_in_ms | state |
+---------+----------+---------+----------------------+----------+
| 20039 | col4 | | 2023-06-01 17:22:15 | FINISHED |
| 20040 | col2 | | 2023-06-01 17:22:15 | FINISHED |
| 20041 | col3 | | 2023-06-01 17:22:15 | FINISHED |
| 20042 | col1 | | 2023-06-01 17:22:15 | FINISHED |
+---------+----------+---------+----------------------+----------+View column statistics
SHOW COLUMN [cached] STATS table_name [ (column_name [, ...]) ];| Parameter | Description |
|---|---|
cached | Show statistics currently cached in FE memory |
table_name | The table to inspect. Use the database_name.table_name format. |
column_name | One or more columns. Separate multiple column names with commas. |
Example:
SHOW COLUMN STATS lineitem(l_tax)\G*************************** 1. row ***************************
column_name: l_tax
count: 6001215.0
ndv: 9.0
num_null: 0.0
data_size: 4.800972E7
avg_size_byte: 8.0
min: 0.00
max: 0.08
method: FULL
type: FUNDAMENTALS
trigger: MANUAL
query_times: 0
updated_time: 2023-11-07 11:00:46Stop a collection job
KILL ANALYZE job_id;The job ID is returned by ANALYZE in asynchronous mode, or retrieved from SHOW ANALYZE.
Example:
KILL ANALYZE 52357;Session variables and FE configuration items
Session variables
| Variable | Default | Description |
|---|---|---|
auto_analyze_start_time | 00:00:00 | Start time of the automatic collection window |
auto_analyze_end_time | 23:59:59 | End time of the automatic collection window |
enable_auto_analyze | true | Enable or disable automatic collection |
huge_table_default_sample_rows | 4194304 | Number of rows sampled for large tables |
huge_table_lower_bound_size_in_bytes | 5368709120 | Size threshold above which a table is treated as large (default: 5 GiB) |
huge_table_auto_analyze_interval_in_millis | 43200000 | Minimum interval between automatic collections for large tables (default: 12 hours) |
table_stats_health_threshold | 60 | Health score threshold (0–100). A table is considered stale when its health falls below this value. |
analyze_timeout | 43200 | Timeout for a collection job, in seconds |
auto_analyze_table_width_threshold | 70 | Maximum number of columns a table can have to participate in automatic collection |
FE configuration items
These items control background behavior. In most cases, the defaults are sufficient.
| FE configuration item | Default | Description |
|---|---|---|
analyze_record_limit | 20000 | Maximum number of job execution records stored persistently |
stats_cache_size | 500000 | Maximum number of statistics rows cached on the FE side |
statistics_simultaneously_running_task_num | 3 | Maximum number of concurrent asynchronous collection jobs |
statistics_sql_mem_limit_in_bytes | 2147483648 | Maximum BE memory per SQL statement used for collection (default: 2 GiB) |
FAQ
The "Stats table not available..." error appears after running ANALYZE
Run SHOW BACKENDS to confirm that all BE nodes are in a normal state. If the BEs look healthy, check the tablet status of the internal statistics table:
ADMIN SHOW REPLICA STATUS FROM __internal_schema.[tbl_in_this_db];Make sure all tablets show a normal status. The FE checks tablet availability before accepting any ANALYZE request and rejects it if any tablet in __internal_schema.column_statistics is unavailable.
Statistics collection fails on a large table
Use sampled collection instead of a full scan. The resources available to ANALYZE are strictly limited, and a full scan of a large table can time out or exhaust BE memory:
ANALYZE TABLE <table_name> WITH SAMPLE PERCENT 10;Adjust the sampling ratio based on table size and the accuracy you need.