The ANALYZE TABLE statement collects column statistics for a table. The query optimizer uses these statistics to generate efficient execution plans.
Syntax
ANALYZE TABLE <table_name> [PARTITION (<pt_spec>)]
COMPUTE STATISTICS FOR COLUMNS [(<column_name> [, <column_name> ...])];
Omit PARTITION for non-partitioned tables. Omit the column list to collect statistics for all columns.
Delete statistics
To delete statistics instead of collecting them:
-- Delete statistics for all columns
ANALYZE TABLE <table_name> DELETE STATISTICS FOR COLUMNS;
-- Delete statistics for specific columns
ANALYZE TABLE <table_name> DELETE STATISTICS FOR COLUMNS (<column_name> [, <column_name> ...]);
Parameters
| Parameter | Description |
|---|---|
table_name |
The name of the table. |
column_name |
The name of a column to analyze. Omit to analyze all columns. |
pt_spec |
The partition to analyze. Format: <partition_key>='<value>'[, <partition_key>='<value>' ...]. For example, ds='20201221' or ds='20201221', hr='11'. |
Statistics output fields
After running ANALYZE TABLE, use SHOW STATISTIC to view the results. The output includes the following fields:
| Field | Applies to | Description |
|---|---|---|
MaxValue |
Numeric types | Maximum value in the column. |
MinValue |
Numeric types | Minimum value in the column. |
DistinctNum |
All types | Number of distinct values (NDV). |
NullNum |
All types | Number of null values. |
TopK |
All types | Up to 20 values with the highest occurrence frequency. For example, 10=1.0 means the value 10 appears once. |
MaxLength |
STRING, VARCHAR | Maximum column value length. |
AvgLength |
STRING, VARCHAR | Average column value length. |
Examples
Collect statistics on a non-partitioned table
The following minimal example shows the basic syntax:
ANALYZE TABLE my_table COMPUTE STATISTICS FOR COLUMNS (col1, col2);
The following complete example creates a table, inserts data, collects statistics, and verifies the results.
-- Create a non-partitioned table
CREATE TABLE IF NOT EXISTS analyze2_test (
tinyint1 TINYINT,
smallint1 SMALLINT,
int1 INT,
bigint1 BIGINT,
double1 DOUBLE,
decimal1 DECIMAL,
decimal2 DECIMAL(20, 10),
string1 STRING,
varchar1 VARCHAR(10),
boolean1 BOOLEAN,
timestamp1 TIMESTAMP,
datetime1 DATETIME
) LIFECYCLE 30;
-- Insert sample data
INSERT OVERWRITE TABLE analyze2_test
SELECT * FROM VALUES
(1Y, 20S, 4, 8L, 123452.3, 12.4, 52.5, 'str1', 'str21', false, TIMESTAMP '2018-09-17 00:00:00', DATETIME '2018-09-17 00:59:59'),
(10Y, 2S, 7, 11111118L, 67892.3, 22.4, 42.5, 'str12', 'str200', true, TIMESTAMP '2018-09-17 00:00:00', DATETIME '2018-09-16 00:59:59'),
(20Y, 7S, 4, 2222228L, 12.3, 2.4, 2.57, 'str123', 'str2', false, TIMESTAMP '2018-09-18 00:00:00', DATETIME '2018-09-17 00:59:59'),
(null, null, null, null, null, null, null, null, null, null, null, null)
AS t(tinyint1, smallint1, int1, bigint1, double1, decimal1, decimal2, string1, varchar1, boolean1, timestamp1, datetime1);
-- Collect statistics for a single column
ANALYZE TABLE analyze2_test COMPUTE STATISTICS FOR COLUMNS (tinyint1);
-- Collect statistics for multiple columns
ANALYZE TABLE analyze2_test COMPUTE STATISTICS FOR COLUMNS (smallint1, string1, boolean1, timestamp1);
-- Collect statistics for all columns
ANALYZE TABLE analyze2_test COMPUTE STATISTICS FOR COLUMNS;
Verify the results with SHOW STATISTIC:
-- Verify single-column statistics
SHOW STATISTIC analyze2_test COLUMNS (tinyint1);
-- Verify multi-column statistics
SHOW STATISTIC analyze2_test COLUMNS (smallint1, string1, boolean1, timestamp1);
-- Verify all-column statistics
SHOW STATISTIC analyze2_test COLUMNS;
The following result is returned for the tinyint1 column:
ID = 20201126085225150gnqo****
tinyint1:MaxValue: 20
tinyint1:DistinctNum: 4.0
tinyint1:MinValue: 1
tinyint1:NullNum: 1.0
tinyint1:TopK: {1=1.0, 10=1.0, 20=1.0}
Collect statistics on a partitioned table
The following complete example creates a partitioned table, inserts data across multiple partitions, collects statistics for a specific partition, and verifies the results.
-- Create a partitioned table
CREATE TABLE IF NOT EXISTS srcpart_test (
key STRING,
value STRING
)
PARTITIONED BY (ds STRING, hr STRING)
LIFECYCLE 30;
-- Insert data into multiple partitions
INSERT INTO TABLE srcpart_test PARTITION(ds='20201220', hr='11') VALUES
('123', 'val_123'), ('76', 'val_76'), ('447', 'val_447'), ('1234', 'val_1234');
INSERT INTO TABLE srcpart_test PARTITION(ds='20201220', hr='12') VALUES
('3', 'val_3'), ('12331', 'val_12331'), ('42', 'val_42'), ('12', 'val_12');
INSERT INTO TABLE srcpart_test PARTITION(ds='20201221', hr='11') VALUES
('543', 'val_543'), ('2', 'val_2'), ('4', 'val_4'), ('9', 'val_9');
INSERT INTO TABLE srcpart_test PARTITION(ds='20201221', hr='12') VALUES
('23', 'val_23'), ('56', 'val_56'), ('4111', 'val_4111'), ('12333', 'val_12333');
-- Collect statistics for the ds='20201221' partition
ANALYZE TABLE srcpart_test PARTITION(ds='20201221') COMPUTE STATISTICS FOR COLUMNS (key, value);
Verify the results:
SHOW STATISTIC srcpart_test PARTITION (ds='20201221') COLUMNS (key, value);
The following result is returned, with statistics broken out by sub-partition:
ID = 20210105121800689g28p****
(ds=20201221,hr=11) key:MaxLength 3.0
(ds=20201221,hr=11) key:AvgLength: 1.0
(ds=20201221,hr=11) key:DistinctNum: 4.0
(ds=20201221,hr=11) key:NullNum: 0.0
(ds=20201221,hr=11) key:TopK: {2=1.0, 4=1.0, 543=1.0, 9=1.0}
(ds=20201221,hr=11) value:MaxLength 7.0
(ds=20201221,hr=11) value:AvgLength: 5.0
(ds=20201221,hr=11) value:DistinctNum: 4.0
(ds=20201221,hr=11) value:NullNum: 0.0
(ds=20201221,hr=11) value:TopK: {val_2=1.0, val_4=1.0, val_543=1.0, val_9=1.0}
(ds=20201221,hr=12) key:MaxLength 5.0
(ds=20201221,hr=12) key:AvgLength: 3.0
(ds=20201221,hr=12) key:DistinctNum: 4.0
(ds=20201221,hr=12) key:NullNum: 0.0
(ds=20201221,hr=12) key:TopK: {12333=1.0, 23=1.0, 4111=1.0, 56=1.0}
(ds=20201221,hr=12) value:MaxLength 9.0
(ds=20201221,hr=12) value:AvgLength: 7.0
(ds=20201221,hr=12) value:DistinctNum: 4.0
(ds=20201221,hr=12) value:NullNum: 0.0
(ds=20201221,hr=12) value:TopK: {val_12333=1.0, val_23=1.0, val_4111=1.0, val_56=1.0}
When you analyze a partition, MaxCompute returns separate statistics for each sub-partition within the specified range.