AnalyticDB for PostgreSQL V7.0 provides the column group statistics feature that allows you to collect statistics on multiple columns. This feature breaks the assumption of independent distribution for cost estimation. Therefore, it can increase the accuracy of cost estimation and improve query performance.

Precautions

The column group statistics feature supports only the planner optimizer (formerly known as legacy optimizer). For more information about how to configure an optimizer, see Configure parameters.

Syntax

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON column_name, column_name [, ...]
    FROM table_name

Parameters

Parameter Description
statistics_name The name of the column group statistical item.
statistics_kind The type of statistics to be collected. Valid values:
  • ndistinct: the number of distinct values among multiple columns.
  • dependencies: the functional dependency among multiple columns.
  • mcv: the value that most frequently occurs among multiple columns and the occurrence frequency of the value.
If this parameter is not specified, all the preceding types of statistics are collected.
column_name The name of the column involved in column group statistics. At least two columns must be specified.
table_name The name of the table that contains the preceding columns.

Examples

  1. Create a table named ndistinct and insert data into the table.
    CREATE TABLE ndistinct(a int, b int, c int, d int) DISTRIBUTED BY (d);
    INSERT INTO ndistinct(a, b) SELECT i/100, i/200, i/100, i FROM generate_series(1, 10000000) i;

    The values of columns a and b are strongly dependent, which violates the assumption of independent distribution.

  2. Create a function to calculate the estimated and actual numbers of rows.
    CREATE FUNCTION check_estimated_rows(text) returns table (estimated int, actual int)
    LANGUAGE plpgsql AS
    $$
    DECLARE
        ln text;
        tmp text[];
        first_row bool := true;
    BEGIN
        FOR ln in
            execute format('explain analyze %s', $1)
        LOOP
            IF first_row then
                first_row := false;
                tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
                return query select tmp[1]::int, tmp[2]::int;
            END IF;
        END LOOP;
    END;
    $$;
  3. Query the estimated and actual numbers of rows.
    SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');

    The following information is returned:

     estimated | actual
    -----------+--------
       1000000 | 100001
    (1 row)

    In this example, the estimated number of rows is 1000000, and the actual number of rows is 100001. A large difference exists.

  4. Query the execution plan.
    EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;

    The following execution plan is returned. It can be used for comparison with the execution plan after a column group statistical item is created.

                                                                            QUERY PLAN
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=137814.33..154481.00 rows=1000000 width=20) (actual time=3986.143..4014.601 rows=100001 loops=1)
       ->  HashAggregate  (cost=137814.33..141147.67 rows=333333 width=20) (actual time=3977.037..4002.222 rows=33661 loops=1)
             Group Key: a, b, c
             Peak Memory Usage: 0 kB
             ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..104481.00 rows=3333333 width=12) (actual time=0.079..1492.983 rows=3366100 loops=1)
                   Hash Key: a, b, c
                   ->  Seq Scan on ndistinct  (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.050..1101.632 rows=3334839 loops=1)
     Planning Time: 0.161 ms
       (slice0)    Executor memory: 12336K bytes.
       (slice1)    Executor memory: 13884K bytes avg x 3 workers, 13899K bytes max (seg0).  Work_mem: 16401K bytes max.
       (slice2)    Executor memory: 37K bytes avg x 3 workers, 37K bytes max (seg0).
     Memory used:  128000kB
     Optimizer: Postgres query optimizer
     Execution Time: 4041.613 ms
    (14 rows)
  5. Create a column group statistical item and execute the ANALYZE statement to collect statistics.
    CREATE STATISTICS s_a_b(ndistinct) ON a,b FROM ndistinct;
    ANALYZE ndistinct;
  6. Query the estimated and actual numbers of rows again.
    SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');

    The following information is returned:

     estimated | actual
    -----------+--------
         99431 | 100001
    (1 row)

    In this example, the estimated number of rows is 99431, and the actual number of rows is 100001. The estimated number of rows is more accurate after the column group statistical item is created.

  7. Query the execution plan.
    EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;

    The following execution plan is returned. The execution plan shows that Partial HashAggregate is generated to speed up query execution because the estimation is more accurate.

                                                                               QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=75124.91..76782.09 rows=99431 width=20) (actual time=2854.765..2879.734 rows=100001 loops=1)
       ->  Finalize HashAggregate  (cost=75124.91..75456.34 rows=33144 width=20) (actual time=2853.610..2868.194 rows=33661 loops=1)
             Group Key: a, b, c
             Peak Memory Usage: 0 kB
             ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=71147.67..74130.60 rows=99431 width=20) (actual time=2269.435..2759.413 rows=100983 loops=1)
                   Hash Key: a, b, c
                   ->  Partial HashAggregate  (cost=71147.67..72141.98 rows=99431 width=20) (actual time=2744.039..2794.808 rows=100001 loops=1)
                         Group Key: a, b, c
                         Peak Memory Usage: 0 kB
                         ->  Seq Scan on ndistinct  (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.028..454.030 rows=3334839 loops=1)
     Planning Time: 0.173 ms
       (slice0)    Executor memory: 4670K bytes.
       (slice1)    Executor memory: 3134K bytes avg x 3 workers, 3149K bytes max (seg0).  Work_mem: 5649K bytes max.
       (slice2)    Executor memory: 13848K bytes avg x 3 workers, 13848K bytes max (seg0).  Work_mem: 14353K bytes max.
     Memory used:  128000kB
     Optimizer: Postgres query optimizer
     Execution Time: 2893.470 ms
    (17 rows)