All Products
Search
Document Center

PolarDB:Statistical analysis of specified columns in wide table mode

Last Updated:Jan 15, 2025

In wide table mode, a table contains tens of columns, but only some columns need to be counted or analyzed in queries. In this case, you can use the vectorized engine to accelerate queries.

Background information

In many SaaS business systems, a table may contain dozens or even hundreds of columns, which brings technical challenges for queries.

  • Only a few columns need to be analyzed in some queries, but many irrelevant columns are read when the row-oriented storage engine is used. This increases the I/O burden on the system.

  • The query mode is not fixed. Multiple filter conditions can be used when you query hundreds of columns. When you build a combined index, you must match all query scenarios in advance. If the query conditions change, the combined index becomes invalid.

The vectorized engine and IMCI can be used to accelerate queries in the preceding scenarios. The internal structure of an IMCI is based on columns. Therefore, reading one column does not affect other columns, and there is no sequential relationship between the columns. The order of multiple query conditions does not affect how the IMCI works.

Use cases

Results

If the table contains 100 million rows and the degree of parallelism is 4, the performance when the vectorized engine feature and IMCI are used is 30 times higher than that of the parallel execution in the native PostgreSQL.

Query

Parallel execution in the native PostgreSQL

Vectorized engine + IMCI

Q1

243 s

7.9 s

Procedure

The following case uses the widecolumntable table that contains 24 columns and includes data types such as BIGINT, DECIMAL, TEXT, JSONB, and TEXT[]. Five columns of id_1, domain, consumption, start_time, and end_time are analyzed to obtain the sum spent by each customer in multiple domains in the past year.

Note

Install the polar_csi extension in the destination database. For more information, see Enable and use the vectorized engine feature.

  1. Create the widecolumntable table that uses the following structure, and then insert test data.

    CREATE TABLE widecolumntable (
      id_1 BIGINT NOT NULL PRIMARY KEY,
      id_2 BIGINT,
      id_3 BIGINT,
      id_4 BIGINT,
      id_5 BIGINT,
      id_6 BIGINT,
      version INT,
      domain TEXT,
      consumption DECIMAL(18,3),
      c_level CHARACTER varying(1) NOT NULL,
      priority BIGINT,
      operator TEXT,
      notify_policy TEXT,
      call_id UUID NOT NULL,
      provider_id BIGINT NOT NULL,
      name_1 TEXT NOT NULL,
      name_2 TEXT NOT NULL,
      name_3 TEXT,
      start_time TIMESTAMP WITH TIME ZONE NOT NULL,
      end_time TIMESTAMP WITH TIME ZONE NOT NULL,
      comment JSONB NOT NULL,
      description TEXT[] NOT NULL,
      created_at TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
      updated_at TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL
    );
  2. Add the id_1, domain, consumption_time, start_time, and end_time columns to the IMCI.

    CREATE INDEX idx_wide_csi ON widecolumntable USING CSI(id_1, domain, consumption,  start_time, end_time);
  3. Query the sum spent by each customer in different domains in the past year and sort them by value.

    • Use the vectorized engine.

      --- Enable the vectorized engine feature and set the degree of parallelism to 4.
      SET polar_csi.enable_query to on;
      SET polar_csi.exec_parallel to 4;
      
      ---Q1
      EXPLAIN ANALYZE SELECT id_1, domain,SUM(consumption) FROM widecolumntable WHERE start_time > '20230101' and end_time < '20240101' GROUP BY id_1, domain Order By SUM(consumption);
    • Disable the vectorized engine and use the row-oriented storage engine.

      --- Disable the vectorized engine, use the row-oriented storage engine, and set the degree of parallelism to 4.
      SET polar_csi.enable_query to off;
      SET max_parallel_workers_per_gather to 4;
      
      ---Q1
      EXPLAIN ANALYZE SELECT id_1, domain,SUM(consumption) FROM widecolumntable WHERE start_time > '20230101' and end_time < '20240101' GROUP BY id_1, domain Order By SUM(consumption);