The MaxCompute optimizer supports histograms for columns in tables. A histogram is used to describe the distribution of column values in different value ranges. Histograms provide finer-grained statistics and help optimize query performance.
Histogram introduction
A histogram is a tool that describes the distribution of column values in different value ranges. Histograms enable finer-grained statistics than other column-level statistical methods.
A histogram comprises a number of non-overlapping buckets. A bucket corresponds to the statistical metrics within a value range. The statistical metrics include the minimum value, maximum value, number of distinct values (NDV), and number of records.
Precautions
You can collect histograms only for columns of integer and floating-point data types.
You can use only the ANALYZE statement to collect histograms. Automatic histogram collection is not supported.
Histograms of a maximum of 30 columns can be collected at a time.
You cannot use the ANALYZE statement to collect statistics, including histogram-based statistics, on empty partitions or tables.
If new data is written after a histogram is collected, the histogram becomes invalid.
ImportantHistogram collection generates computing costs. If a histogram becomes invalid, you can determine whether to collect a histogram again based on your business requirements.
You can perform cardinality estimation based on histograms only for a single partition or table. Buckets cannot be combined across partitions.
Collect histograms
Syntax
ANALYZE TABLE <tablename> compute statistics FOR columns [(...)] [[WITH histogram [256 buckets]] [columns (...)]];tablename: the name of the table for which you want to collect histograms.
Scenarios
Collect histograms for all columns. Sample statement:
ANALYZE TABLE <tablename> compute statistics FOR columns WITH histogram;The collection costs of histograms are greater than those of other statistical methods. To prevent the misuse of computing resources, the system returns the following error message if the number of columns for which histograms can be collected exceeds the default limit 10:
Analyze histogram column number exceeds auto columns limit, please specify column names for analyze(xxx) or set unlimited auto columns(xxx)You can set the following flag parameter to remove the upper limit on the number of columns for which histograms can be collected. After the upper limit is removed, the system collects histograms for all columns for which histograms can be collected in a table. However, histograms of a maximum of 30 columns can be collected at a time.
set odps.sql.analyze.histogram.auto.column.num = -1;The system collects histograms only for columns of integer and floating-point data types and ignores other columns.
Collect histograms for all columns and specify the number of buckets. Sample statement:
ANALYZE TABLE <tablename> compute statistics FOR columns WITH histogram 256 buckets;The default number of buckets is 256. The maximum number is 1,024.
Specify columns for which you want to collect their values and collect histograms for the specified columns. Sample statement:
ANALYZE TABLE <tablename> compute statistics FOR columns(col1, col2) WITH histogram;Specify columns for which you want to collect their values and columns for which you want to collect histograms. Sample statement:
ANALYZE TABLE <tablename> compute statistics FOR columns(col1, col2) WITH histogram columns (col1);ImportantWhen you specify a column to collect its histogram, the column must be a subset of the
columnslist. For example, in the previous statement,col1inwith histogram columns (col1)is a subset of (col1, col2).Specify columns for which you want to collect their values, columns for which you want to collect histograms, and the number of buckets. Sample statement:
ANALYZE TABLE <tablename> compute statistics FOR columns(col1, col2) WITH histogram 256 buckets columns (col1);
Use histograms
Display the statistics of a histogram.
show statistic <tablename> columns;Run the following command to enable the histogram feature for query optimization. After you enable the histogram feature, the optimizer performs cardinality estimation based on histograms:
set odps.sql.optimizer.histogram.enable=true;