This topic describes the table size caching feature of PolarDB for PostgreSQL (Compatible with Oracle).
Background information
During the execution lifecycle of an SQL statement in a database, the involved system and user tables, along with their indexes, are accessed multiple times. During this process, PolarDB for PostgreSQL (Compatible with Oracle) frequently queries the number of blocks occupied by a table file by invoking the smgrnblocks function of the storage manager smgr. The smgrnblocks function invokes the lseek function of the file system to query the file size, which is then used to calculate the number of blocks based on the block size. However, the smgrnblocks function resides on the critical path of SQL processing. Frequent invocations to a function of the file system severely affect database performance. This issue is further compounded in PolarDB for PostgreSQL (Compatible with Oracle), which uses the distributed Polar File System (PolarFS). Compared to centralized file systems, PolarFS exhibits higher latency for invoking the lseek function. This prolongs the execution duration of SQL statements.
To minimize file system calls, PolarDB for PostgreSQL (Compatible with Oracle) implements a Relation Size Cache (RSC) within the smgr to store the number of blocks occupied by each table file in shared memory. When the number of blocks occupied by a table changes, the cached value in shared memory is updated accordingly. The RSC is preferentially used to query the number of blocks occupied by a table file. This reduces the number of file system calls and accelerates SQL execution.
Prerequisites
Your PolarDB for PostgreSQL (Compatible with Oracle) cluster runs PolarDB for PostgreSQL (Compatible with Oracle) 2.0 whose revision version is 2.0.14.12.23.1 or later.
View the revision version in the console or by executing the SHOW polardb_version; statement. Upgrade the revision version if needed.
Feature description
RSC query
The RSC consists of a one-dimensional array and a hash table in shared memory.
Each cache entry in the one-dimensional array stores the number of blocks occupied by a relation such as a table or an index.
The hash table maps relation identifiers, known as RelFileNode values, to their corresponding cache entries in the one-dimensional array. A RelFileNode value uniquely identifies a relation at the storage layer.
All processes attempt to query the RSC by using the following two levels of indexes in sequence:
Level 1 index: This index caches pointers to recently accessed RSC cache entries and their corresponding generation numbers. When a process requests information about a table, the level 1 index is checked first. If the requested table matches the one stored in the level 1 index and the generation number of the RSC cache entry corresponding to the pointer remains unchanged, the process can directly obtain the number of blocks occupied by the table by using the pointer. In scenarios with a high read-to-write ratio for specific tables, the level 1 index ensures a high hit ratio. If modifications occur to the mappings in the RSC hash table, the hash table must be updated, and the generation numbers of the affected RSC cache entries must be automatically incremented. This mechanism ensures that stale data in the level 1 index is invalidated.
Level 2 index: If the level 1 index does not contain the required information, which indicates a cache miss, the process checks the level 2 index. This index queries the hash table in shared memory to obtain the RSC cache entry corresponding to the required table. Once the cache entry is located, the process obtains the number of blocks occupied by the table, and updates the level 1 index.
If the table and its corresponding block count cannot be queried in the RSC by using the preceding two levels of indexes, the cache replacement mechanism is triggered. This mechanism uses the Segmented Least Recently Used (SLRU) algorithm to select and clear a recently unused RSC cache entry. Subsequently, the lseek function of the file system is invoked to determine the actual block count. This count is then updated to the previously cleared RSC cache entry, and the two levels of indexes are synchronously updated.
RSC updates on the primary and standby nodes
After the RSC feature is enabled, any modification to the size of a table by invoking a function of the smgr requires a synchronous update to the relevant RSC cache entry in shared memory. This ensures that the cached block count of the table is the same as the actual number of blocks occupied by the table in the file system.
When you invoke a function to extend a table, you must update the number of blocks occupied by the extended table to the RSC.
When you invoke a function to truncate a table, you must update the number of blocks occupied by the truncated table to the RSC.
The standby nodes use physical replication to stay synchronized with the primary node. During the replay of Write Ahead Log (WAL) records, the standby nodes invoke the same functions of the smgr to modify the table size. Therefore, RSC updates are identically handled on the primary and standby nodes.
RSC updates on replica nodes
Replica nodes share physical storage with the primary node and stay synchronized with the primary node by using log indexes. Replica nodes do not perform write operations on the storage. Therefore, you cannot invoke a function of the smgr to update the RSC values in shared memory.
To determine whether file sizes have changed in the shared storage, replica nodes parse the sequence numbers of data blocks referenced in WAL records and compare the sequence numbers with the block count stored in the RSC. If a WAL record references a block sequence number higher than the cached block count, the cached block count is updated to reflect the actual block count. If a WAL record that indicates table truncation is parsed, the cached block count in the RSC is invalidated. The next time a function of the file system is invoked to obtain the number of blocks occupied by the truncated table, the RSC is repopulated with the actual block count.
Usage notes
Grand Unified Configuration (GUC) parameters
The following table describes the GUC parameters provided by the RSC.
Parameter | Description |
polar_enable_rel_size_cache | Specifies whether to enable the RSC feature. Valid values:
|
polar_enable_replica_rel_size_cache | Specifies whether to enable the RSC feature for replica nodes. Valid values:
|
polar_enable_standby_rel_size_cache | Specifies whether to enable the RSC feature for standby nodes. Valid values:
|
Performance test
When the RSC feature is disabled, the latency for querying the number of blocks occupied by a 32-GB table is approximately 55 microseconds.
SHOW polar_enable_rel_size_cache; polar_enable_rel_size_cache ----------------------------- off (1 row) SELECT polar_smgrperf_nblocks(32, true, false); NOTICE: testing logical file length with 32 GB INFO: iops=18341.1/s, lat=54.52us INFO: iops=17504.0/s, lat=57.13us INFO: iops=17960.8/s, lat=55.68us INFO: iops=17973.0/s, lat=55.64us INFO: iops=17603.5/s, lat=56.81us INFO: iops=17403.8/s, lat=57.46us INFO: iops=17506.2/s, lat=57.12us INFO: iops=18061.7/s, lat=55.37usWhen the RSC feature is enabled, the latency for querying the number of blocks occupied by a 32-GB table is approximately 0.07 microseconds.
SHOW polar_enable_rel_size_cache; polar_enable_rel_size_cache ----------------------------- on (1 row) SELECT polar_smgrperf_nblocks(32, true, false); NOTICE: testing logical file length with 32 GB INFO: iops=14155515.6/s, lat=0.07us INFO: iops=13897273.6/s, lat=0.07us INFO: iops=13869926.3/s, lat=0.07us INFO: iops=13779602.7/s, lat=0.07us INFO: iops=14159120.5/s, lat=0.07us INFO: iops=14147065.6/s, lat=0.07us INFO: iops=14124141.9/s, lat=0.07us INFO: iops=14162773.3/s, lat=0.07us