Query and analyze table statistics in Hologres - Hologres

Hologres V1.3 and later collect statistics on tables in your Hologres instance on a daily basis and store the statistics in the system table named hologres.hg_table_info. This way, you can query and analyze the statistics on your tables and take optimization measures based on the results of statistical analysis. This topic describes how to query and analyze table statistics in Hologres.

Limits

Hologres V1.3 and later allow you to query table statistics. If the version of your Hologres instance is earlier than V1.3, manually upgrade your Hologres instance in the Hologres console or join a Hologres DingTalk group to apply for an instance upgrade. For more information about how to manually upgrade a Hologres instance, see Instance upgrades. For more information about how to join a Hologres DingTalk group, see Obtain online support for Hologres.
The hologres.hg_table_info table is updated with a delay of one day. Data of the current day may be updated before 05:00 on the next day. On the day when your Hologres instance is upgraded from V1.1 to V1.3, table statistics will not be generated. If you query the table statistics on this day, the following error message is reported: meta warehouse store currently not available. In this case, you need to query table statistics the next day after the upgrade is complete.

Usage notes

By default, the table statistics of the previous 30 days are retained.
For a non-partitioned internal table in Hologres, you can query the detailed statistics on the table, including the storage space, the number of files, the accumulated number of times that the table was accessed, and the number of rows.
For other objects in Hologres, such as views, materialized views, foreign tables, and parent tables, you can query only the basic information. For example, you can query the number of partitions of a parent table, the name of the table to which a foreign table maps, and the definition of a view or materialized view.
The hologres.hg_table_info table belongs to the meta warehouse system of Hologres. Failure in querying the hologres.hg_table_info table does not affect queries in instances. Therefore, the stability of the hologres.hg_table_info table is not specified in the service level agreement (SLA) of Hologres.

hologres.hg_table_info table

Table statistics are stored in the system table named hologres.hg_table_info. If the version of your Hologres instance is V1.3 or later, table statistics are collected on a daily basis by default. The following information describes the fields that are contained in the hologres.hg_table_info table.

       Field          |           Data type           |   Description
--------------------+--------------------------+-----------------------------------------------------------------------
 db_name            | text                     | The name of the database to which the table belongs. 
 schema_name        | text                     | The name of the schema to which the table belongs. 
 table_name         | text                     | The name of the table. 
 table_id           | text                     | The unique ID of the table. If the table is a foreign table, the unique ID is in the format of Database name.Schema name.Table name. 
 type               | text                    | The type of the table, which can be TABLE, FOREIGN TABLE, PARTITION, VIEW, or MATERIALIZED VIEW. 
 partition_spec     | text                     | The partitioning condition. The field is valid only if the current table is a child table. 
 is_partition       | boolean                  | Indicates whether the table is a child table. 
 owner_name         | text                     | The username of the user who owns the table. This field corresponds to the username field of the hologres.hg_query_log table. You can join the two tables based on the owner_name and usename fields. 
 create_time        | timestamp with time zone | The time when the table was created. 
 last_ddl_time      | timestamp with time zone | The time when the table statistics were last updated. 
 last_modify_time   | timestamp with time zone | The time when the table data was last modified. 
 last_access_time   | timestamp with time zone | The time when the table was last accessed. 
 view_def           | text                     | The definition of the view. This field is valid only if the current table is a view. 
 comment            | text                     | The description of the table or the view. 
 hot_storage_size   | bigint                   | The storage space that is occupied by the hot data of the table, in bytes. 
 cold_storage_size  | bigint                   | The storage space that is occupied by the cold data of the table, in bytes. 
 hot_file_count     | bigint                   | The number of hot data files of the table. 
 cold_file_count    | bigint                   | The number of cold data files of the table. 
 table_meta         | jsonb                    | The metadata of the table. The value is in the JSON format. 
 row_count          | bigint                   | The number of rows in the table or the partition. 
 collect_time       | timestamp with time zone | The time when the current statistics were collected. 
 partition_count    | bigint                   | The number of child tables. This field is valid only if the current table is a parent table. 
 parent_schema_name | text                     | The name of the schema to which the parent table belongs. This field is valid only if the current table is a child table. 
 parent_table_name  | text                     | The name of the parent table. This field is valid only if the current table is a child table. 
 total_read_count   | bigint                   | The accumulated number of times that data was read from the table. The number can be affected by SELECT, INSERT, UPDATE, and DELETE operations and does not indicate an accurate value. 
 total_write_count  | bigint                   | The accumulated number of times that data was written to the table. The number can be affected by INSERT, UPDATE, and DELETE operations and does not indicate an accurate value.

Specific fields may be empty. If a table was created before the version of the relevant Hologres instance was upgraded to V1.3 or later, certain statistics about the creation of the table failed to be collected. Such statistics can be collected for tables that are created after the version of the relevant Hologres instance is upgraded to V1.3 or later.
If the value of the type field is VIEW, the create_time field and the last_ddl_time field are empty.
If the value of the type field is VIEW, FOREIGN TABLE, or PARTITION, the following fields are empty: last_modify_time, last_access_time, hot_file_count, cold_file_count, total_read_count, and total_write_count.
You can obtain the storage size of a table from the hologres.hg_table_info system table or use the pg_relation_size function. If you use the two methods to query the storage size of a table, the query results may be different. This case is normal. This is because statistics in the hologres.hg_table_info system table are collected on a daily basis and the query result of the pg_relation_size function does not contain the storage size of binary logs.

Grant query permissions to a user

Only an authorized user can query table statistics. The following section describes how to grant the relevant query permissions to a user in different scenarios.

Query the table statistics of all databases in a Hologres instance.

Assign the superuser role to a user.
After you assign the superuser role to a user, the user can view the table statistics of all databases in the Hologres instance.
```
-- Replace Account ID with the Alibaba Cloud account ID of the user. For a RAM user, prefix the account ID with p4_. 
ALTER USER "Account ID" SUPERUSER;
```

Add users to the pg_stat_scan_table user group.

In addition to superusers, users in the pg_stat_scan_tables (for versions earlier than Hologres V1.3.44) or pg_read_all_stats (for Hologres V1.3.44 and later) user group also have permissions to query the table statistics of all databases in the relevant Hologres instance. If you are a superuser, you can add regular users to the user group. Syntax:

-- For versions earlier than Hologres V1.3.44
GRANT pg_stat_scan_tables TO "Alibaba Cloud account ID";-- Use the standard PostgreSQL authorization model to grant related permissions to the user.
CALL spm_grant('pg_stat_scan_tables, 'Alibaba Cloud account ID'); -- Use the simple permission model (SPM) to grant relevant permissions to the user.
CALL slpm_grant('pg_stat_scan_tables, 'Alibaba Cloud account ID'); -- Use the schema-level permission model (SLPM) to grant relevant permissions to the user.

-- For Hologres V1.3.44 and later
GRANT pg_read_all_stats TO "Account ID"; -- Use the standard PostgreSQL authorization model to grant relevant permissions to the user.
CALL spm_grant('pg_read_all_stats', 'Account ID');  -- Use the simple permission model (SPM) to grant relevant permissions to the user.
CALL slpm_grant('pg_read_all_stats', 'Account ID'); -- Use the schema-level permission model (SLPM) to grant related permissions to the user.

Query the table statistics of a database.
Users in the db_admin user group have the permissions to query the table statistics of the relevant database. If you are a superuser, you can add regular users to this user group by using the SPM or SLPM.
Note
A regular user can query the statistics on only the tables that the user owns in the relevant database.
```
CALL spm_grant('<db_name>_admin', 'Account ID');  -- Use the SPM to grant related permissions to the user.
CALL slpm_grant('<db_name>.admin', 'Account ID'); -- Use the SLPM to grant related permissions to the user.
```

Execute SQL statements to query the trends of table statistics

Scenario 1: Query the trends of table statistics by day in a Hologres instance

-- Query the trends of statistics of all tables in an instance in the previous week. The statistics include the occupied storage space, file quantity, accumulated numbers of times that tables were accessed, and number of row records.
SELECT
  db_name,
  schema_name,
  table_name,
  collect_time :: date as collect_date,
  hot_storage_size,
  cold_storage_size,
  hot_file_count,
  cold_file_count,
  total_read_count,
  total_write_count,
  row_count
FROM
  hologres.hg_table_info
WHERE
  collect_time > (current_date-interval '1 week')::timestamptz -- Specify the previous week.
  AND type = 'TABLE'
;

-- Use an aggregate function to query the trends of statistics on a parent table.
SELECT
  db_name,
  parent_schema_name AS schema_name,
  parent_table_name AS table_name,
  collect_time :: date AS collect_date,
  sum(hot_file_count) AS hot_file_count,
  sum(cold_file_count) AS cold_file_count,
  sum(hot_storage_size) AS hot_storage_size,
  sum(cold_storage_size) AS cold_storage_size,
  sum(total_read_count) AS total_read_count,
  sum(total_write_count) AS total_write_count,
  sum(row_count) AS row_count,
  count(*) AS partition_count
FROM
  hologres.hg_table_info
WHERE
  is_partition
GROUP BY
  collect_date,
  db_name,
  parent_schema_name,
  parent_table_name
;

Scenario 2: Query the access information of tables that occupy large disk space

-- Query the access information about the top 10 tables that occupy large disk space.
SELECT
  db_name,
  schema_name,
  table_name,
  hot_storage_size + cold_storage_size as total_storage_size,
  row_count,
  total_read_count,
  total_write_count
FROM
  hologres.hg_table_info
WHERE
  collect_time > (current_date-interval '1 week')::timestamptz -- Specify the previous week.
  AND type = 'TABLE'
  AND (
    cold_storage_size is NOT NULL
    OR hot_storage_size IS NOT NULL
  )
ORDER BY
  total_storage_size DESC
LIMIT
  10;

Scenario 3: Query the access trends of tables that occupy large disk space

-- Query the previous-week access trends of the top 10 tables that occupy large disk space on the previous day.
with tmp_table_info AS (
  SELECT
    db_name,
    schema_name,
    table_name,
    collect_time,
    hot_storage_size + cold_storage_size as total_storage_size,
    row_count,
    total_read_count,
    total_write_count
  FROM
    hologres.hg_table_info
  WHERE
    collect_time > (current_date-interval '1 week')::timestamptz -- Specify the previous week.
    AND type = 'TABLE'
    AND (
      hot_storage_size IS NOT NULL 
      OR cold_storage_size IS NOT NULL 
    )
)
SELECT
  d.*
FROM
  (
    SELECT
      db_name,
      schema_name,
      table_name,
      total_storage_size
    FROM
      tmp_table_info
    WHERE
      collect_time > (current_date-interval '1 day')::timestamptz -- Specify the previous day.
      and collect_time < current_date::timestamptz
    ORDER BY
      total_storage_size DESC
    LIMIT
      10
  ) t     -- The top 10 tables that occupy large disk space on the previous day.
  left join (
    SELECT
      db_name,
      schema_name,
      table_name,
      collect_time :: date as collect_date,
      total_read_count,
      total_write_count,
      total_storage_size,
      row_count,
      total_storage_size,
      total_read_count,
      total_write_count,
      row_count
    FROM
      tmp_table_info 
  ) d      -- The data size changes by day in the previous seven days.
  ON t.db_name = d.db_name
    AND t.schema_name = d.schema_name
    AND t.table_name = d.table_name
ORDER BY
  t.total_storage_size desc,
  d.collect_date desc,
  d.db_name,
  d.schema_name,
  d.table_name;

Scenario 4: Query the access information and data size changes of tables that occupy large disk by day

-- Query the access information and data size changes of the top 10 tables that occupy large disk space by day in the previous week.
with tmp_table_info AS (
  SELECT
    db_name,
    schema_name,
    table_name,
    collect_time,
    hot_storage_size + cold_storage_size as total_storage_size,
    row_count,
    total_read_count,
    total_write_count
  FROM
    hologres.hg_table_info
  WHERE
    collect_time > (current_date-interval '1 week')::timestamptz -- Specify the previous week.
    AND type = 'TABLE'
    AND (
      hot_storage_size IS NOT NULL 
      OR cold_storage_size IS NOT NULL 
    )
)
SELECT
  d.*
FROM
  (
    SELECT
      db_name,
      schema_name,
      table_name,
      total_storage_size
    FROM
      tmp_table_info
    WHERE
      collect_time > (current_date-interval '1 day')::timestamptz -- Specify the previous day.
      and collect_time < current_date::timestamptz
    ORDER BY
      total_storage_size DESC
    LIMIT
      10
  ) t     -- The top 10 tables that occupy large disk space on the previous day.
  left join (
    SELECT
      db_name,
      schema_name,
      table_name,
      collect_time :: date as collect_date,
      first_value(total_read_count) OVER w - last_value(total_read_count) OVER w AS day_read_count,
      first_value(total_write_count) OVER w - last_value(total_write_count) OVER w AS day_write_count,
      first_value(row_count) OVER w - last_value(row_count) OVER w AS day_modify_count
      FROM
        tmp_table_info 
      WINDOW w AS (
        PARTITION BY 
          db_name,
          schema_name,
          table_name
        ORDER BY
          collect_time DESC 
        ROWS BETWEEN 
          CURRENT ROW AND 1 FOLLOWING  -- The differences between two consecutive days.
      )
  ) d      -- The data size changes by day in the previous seven days.
  ON t.db_name = d.db_name
    AND t.schema_name = d.schema_name
    AND t.table_name = d.table_name
ORDER BY
  t.total_storage_size desc,
  d.collect_date desc,
  d.db_name,
  d.schema_name,
  d.table_name;

Scenario 5: Query the access information and data size changes of tables that occupy large disk space in the previous week

-- Query the access information and data size changes of the top 10 tables that occupy large disk space by day in the previous week.
with tmp_table_info AS (
  SELECT
    db_name,
    schema_name,
    table_name,
    collect_time,
    hot_storage_size + cold_storage_size as total_storage_size,
    row_count,
    total_read_count,
    total_write_count
  FROM
    hologres.hg_table_info
  WHERE
    collect_time > (current_date-interval '1 week')::timestamptz -- Specify the previous week.
    AND type = 'TABLE'
    AND (
      hot_storage_size IS NOT NULL 
      OR cold_storage_size IS NOT NULL 
    )
)
SELECT
  d.*
FROM
  (
    SELECT
      db_name,
      schema_name,
      table_name,
      total_storage_size
    FROM
      tmp_table_info
    WHERE
      collect_time > (current_date-interval '1 day')::timestamptz -- Specify the previous day.
      and collect_time < current_date::timestamptz
    ORDER BY
      total_storage_size DESC
    LIMIT
      10
  ) t     -- The top 10 tables that occupy large disk space on the previous day.
  left join (
    SELECT
      db_name,
      schema_name,
      table_name,
      collect_time,
      first_value(total_read_count) OVER w - last_value(total_read_count) OVER w AS week_read_count,
      first_value(total_write_count) OVER w - last_value(total_write_count) OVER w AS week_write_count,
      first_value(row_count) OVER w - last_value(row_count) OVER w AS week_modify_count
      FROM
        tmp_table_info 
      WINDOW w AS (
        PARTITION BY 
          db_name,
          schema_name,
          table_name
        ORDER BY
          collect_time DESC 
        ROWS BETWEEN 
          CURRENT ROW AND 6 FOLLOWING  -- The differences between two consecutive days in the previous week.
      )
  ) d      -- The data size changes by day in the previous seven days.
  ON t.db_name = d.db_name
    AND t.schema_name = d.schema_name
    AND t.table_name = d.table_name
WHERE
    d.collect_time > (current_date-interval '1 day')::timestamptz -- Specify the previous day.
    and d.collect_time < current_date::timestamptz
ORDER BY
  t.total_storage_size desc,
  d.db_name,
  d.schema_name,
  d.table_name;

Scenario 6: Query the sizes of tables that were least accessed in the previous week

-- Query the size of the top 10 tables that were least accessed in the previous week.
SELECT
  *
FROM
  (
    SELECT
      collect_time,
      schema_name,
      table_name,
      hot_storage_size + cold_storage_size as total_storage_size,
      first_value(total_read_count) OVER w - last_value(total_read_count) OVER w AS recent_read_count
    FROM
      hologres.hg_table_info
    WHERE
      collect_time > (current_date-interval '1 week')::timestamptz -- Specify the previous week.
      AND type = 'TABLE'
      AND (
        hot_storage_size IS NOT NULL 
        OR cold_storage_size IS NOT NULL 
      )
    WINDOW w AS (
      PARTITION BY 
        db_name,
        schema_name,
        table_name
      ORDER BY
        collect_time DESC 
      ROWS BETWEEN 
        CURRENT ROW AND 6 FOLLOWING -- Specify the previous seven days.
    )
  ) tmp
WHERE
  collect_time > (current_date - interval '1 day')::timestamptz
  and collect_time < current_date::timestamptz -- Only the total access data in the previous seven days is retained.
ORDER BY
  recent_read_count
LIMIT
  10;

Scenario 7: Query the number of rows that change on the day when the table was last modified

-- Query the time when the table was last modified, and the change in the number of rows compared with the modification before the last modification.
-- If the instance contains a large number of tables, we recommend that you filter the tmp_table_info table to prevent too long query time due to too large data.
with tmp_table_info AS (
  SELECT
    db_name,
    schema_name,
    table_name,
    row_count,
    collect_time,
    last_modify_time
  FROM
    hologres.hg_table_info
  WHERE
    last_modify_time IS NOT NULL
    AND type = 'TABLE'
    -- Add filter conditions to limit the size of the tmp_table_info table.
    -- Example 1: collect_time > (current_date - interval '14 day'):: timestamptz
    -- Example 2: table_name like ''
    -- Example 3: type = 'PARTITION'
)
SELECT
  end_data.db_name AS db_name,
  end_data.schema_name AS schema_name,
  end_data.table_name AS table_name,
  (end_data.row_count - start_data.row_count) AS modify_row_count,
  end_data.row_count AS current_rows,
  end_data.last_modify_time AS last_modify_time
FROM
  (
    SELECT
      db_name,
      schema_name,
      table_name,
      row_count,
      last_modify_time
    FROM
      tmp_table_info
    WHERE
      collect_time > (current_date-interval '1 day')::timestamptz -- Query the last modification time of the table that was recorded in the previous day.
  ) end_data
  left join (
    SELECT
      db_name,
      schema_name,
      table_name,
      row_count,
      collect_time
    FROM
      tmp_table_info
  ) start_data ON (
    end_data.db_name = start_data.db_name
    AND end_data.schema_name = start_data.schema_name
    AND end_data.table_name = start_data.table_name
    AND end_data.last_modify_time::date = (start_data.collect_time + interval '1 day')::date
  )
;

Scenario 8: Query tables that occupy large disk space due to too many small files

-- Query the number of files for each table and the disk space occupied by the table and sort the tables based on the average file size.
-- The table_group parameter displays only the shard count of the current database. The parameter is left empty for other databases.
SELECT
  db_name,
  schema_name,
  table_name,
  cold_storage_size + hot_storage_size as total_storage_size,
  cold_file_count + hot_file_count as total_file_count,
  (cold_storage_size + hot_storage_size) / (cold_file_count + hot_file_count) AS avg_file_size,
  tmp_table_info.table_meta ->> 'table_group' as table_group,
  tg_info.shard_count
FROM
  hologres.hg_table_info tmp_table_info
  left join (
    SELECT
      tablegroup_name,
      property_value as shard_count
    FROM
      hologres.hg_table_group_properties
    WHERE
      property_key = 'shard_count'
  ) tg_info on tmp_table_info.table_meta ->> 'table_group' = tg_info.tablegroup_name
WHERE
  collect_time > (current_date - interval '1 day')::timestamptz
  AND type = 'TABLE'
  AND (
    cold_storage_size IS NOT NULL
    OR hot_storage_size IS NOT NULL
  )
  AND (
    cold_file_count IS NOT NULL
    OR hot_file_count IS NOT NULL
  )
  AND cold_file_count + hot_file_count <> 0
ORDER BY
  avg_file_size
;