Implementing PostgreSQL Hook: Stats on Tables with Full Scans & Corresponding SQLs

This article provides a guide to implementing a PostgreSQL hook for tracking and gathering detailed statistics on tables undergoing full scans and their corresponding SQL statements.

By digoal

Background

How does PostgreSQL add the hook feature to do statistics on the SQL statements of fully scanned tables?

Demo 1

In PostgreSQL, you can create a custom plug-in to use the hook mechanism, thus getting the statistics of fully scanned tables and the corresponding SQL statements. PostgreSQL's hook mechanism allows developers to insert custom behavior at key operating points without modifying the core PostgreSQL code.

To implement this feature, you may need to use an executor hook or a planner hook depending on when you want to capture the information of a full table scan. One possible approach is to use ExecutorStart_hook which is called before the execution plan starts, or ExecutorRun_hook, which is called before the plan actually runs.

Here's a simplified example of how to use ExecutorStart_hook to add a hook to get statistical information of full table scans:

1. Create a new plug-in project and write custom hook code.

#include "postgres.h"  
#include "executor/executor.h"  
#include "nodes/plannodes.h"  
  
PG_MODULE_MAGIC;  
  
/* Declare the original ExecutorStart function pointer */  
static ExecutorStart_hook_type prev_ExecutorStart = NULL;  
  
/* Implement the custom ExecutorStart hook */  
void MyExecutorStart(QueryDesc *queryDesc, int eflags)  
{  
    /* Check the plan before calling the original ExecutorStart */  
    if (queryDesc->plannedstmt != NULL && queryDesc->plannedstmt->planTree != NULL)  
    {  
        Plan *plan = queryDesc->plannedstmt->planTree;  
        if (nodeTag(plan) == T_SeqScan)  
        {  
            /* A full table scan is found. You can make a record or do statistics here */  
            /* Example: Output to server logs */  
            elog(LOG, " Full table scan performed: %s", queryDesc->sourceText);  
        }  
    }  
  
    /* Call the original ExecutorStart function */  
    if (prev_ExecutorStart)  
    {  
        prev_ExecutorStart(queryDesc, eflags);  
    }  
    else  
    {  
        standard_ExecutorStart(queryDesc, eflags);  
    }  
}  
  
/* Initialization function when the module is loaded */  
void _PG_init(void)  
{  
    /* Save the original ExecutorStart function pointer */  
    prev_ExecutorStart = ExecutorStart_hook;  
      
    /* Set a new hook */  
    ExecutorStart_hook = MyExecutorStart;  
}  
  
/* The cleanup function when the module is uninstalled */  
void _PG_fini(void)  
{  
    /* Restore the original hook */  
    ExecutorStart_hook = prev_ExecutorStart;  
}

2. Compile your plug-in code as a shared library and make sure it is in the PostgreSQL library path.

3. Load the plug-in in PostgreSQL. You can preload the plug-in at database startup by setting the parameter shared_preload_libraries, or load it dynamically using the LOAD command.

-- Add to postgresql.conf 
shared_preload_libraries = 'my_custom_plugin'  
  
-- Or use an SQL command to dynamically load  
LOAD 'my_custom_plugin';

4. Restart the PostgreSQL service for the settings to take effect, or run the LOAD command.

pg_ctl restart

The hook in this example checks whether the current execution plan contains a full table scan (SeqScan) and writes the relevant information to the log before the execution plan starts.

Note that more complex situations need to be handled in actual use, such as nested nodes, multi-table scans, and different types of scans. You may need to traverse the entire plan tree to accurately detect all full table scans.

Before implementing such a custom hook, it is important to conduct adequate testing and code reviews to ensure that it does not affect the performance and stability of the database.

Demo 2

In PostgreSQL, if you want to capture and make statistical analysis on SQL queries of full table scans (Sequential Scans), and output information including database names, schema names, table names, and SQL statements, you may need to rely on several internal hook functions and combine system directory information. The following are the basic steps and sample code to show how to implement this feature by creating a custom plug-in that uses an executor hook.

Define your hook function: You need to define a ExecutorStart_hook, ExecutorRun_hook, or ExecutorFinish_hook. In this hook function, you will check whether the execution plan contains full table scans and collect the required information.
Traverse the execution plan and collect information: To detect whether there are full table scans, you need to traverse the nodes that execute the plan and check whether they are SeqScan nodes. If so, you can obtain the corresponding database name, schema name, and table name from the system directory table.
Output information: Once a full table scan is detected, you can log the information, store it in a database table, or send it to an external monitoring system.

The following is a basic sample code that shows how to define an ExecutorStart_hook and output information to the log when a full table scan is detected:

#include "postgres.h"  
#include "executor/executor.h"  
#include "nodes/plannodes.h"  
#include "utils/rel.h"  
  
PG_MODULE_MAGIC;  
  
void _PG_init(void);  
void _PG_fini(void);  
static void myExecutorStart(QueryDesc *queryDesc, int eflags);  
  
static ExecutorStart_hook_type prev_ExecutorStart = NULL;  
  
static void myExecutorStart(QueryDesc *queryDesc, int eflags) {  
    if (prev_ExecutorStart) {  
        prev_ExecutorStart(queryDesc, eflags);  
    } else {  
        standard_ExecutorStart(queryDesc, eflags);  
    }  
  
    Plan *plan = queryDesc->plannedstmt->planTree;  
    if (nodeTag(plan) == T_SeqScan) {  
        SeqScan *scan = (SeqScan *)plan;  
        Relation relation = scan->scanrelid;  
        Oid relid = relation->rd_id;  
  
        // Obtain the table information  
        char *relname = get_rel_name(relid);  
        Oid schema_oid = get_rel_namespace(relid);  
        char *schemaname = get_namespace_name(schema_oid);  
  
        // Obtain the database information  
        char *dbname = get_database_name(MyDatabaseId);  
  
        // Output the log information  
        elog(LOG, "Database: %s, Schema: %s, Table: %s, SQL: %s",   
             dbname, schemaname, relname, queryDesc->sourceText);  
    }  
}  
  
void _PG_init(void) {  
    prev_ExecutorStart = ExecutorStart_hook;  
    ExecutorStart_hook = myExecutorStart;  
}  
  
void _PG_fini(void) {  
    ExecutorStart_hook = prev_ExecutorStart;  
}

This code provides a basic framework. In real applications, you may need to do more error checking, traverse the entire plan tree to check all possible SeqScan nodes, and consider factors such as concurrency and performance impact.

In addition, to compile this code, you need to configure the PostgreSQL development environment, and after writing, you need to compile it into a dynamic library and place the library file in the PostgreSQL library directory. Finally, you need to set the shared_preload_libraries in the postgresql.conf to load the plugin and restart the PostgreSQL service.

Note that the actual situation may be more complicated because this example is just a starting point. In the generated execution plan, the full table scan may be encapsulated in more complex nodes, such as subquery nodes or CTE nodes. Therefore, it may be necessary to recursively traverse the entire plan tree to ensure that all full table scans are captured.

Before deploying such a custom plug-in, ensure that it is adequately tested to avoid negative impacts on the performance and database stability.

Demo 3

Indeed, the previous example code did not contain logic to traverse the execution tree. In practice, you usually need to recursively check each node in the execution plan tree to determine whether any child nodes have performed a full table scan (SeqScan). The following sample code provides a simplified function that recursively traverses the execution plan tree and a modified ExecutorStart_hook, so as to find and output information about the operation of full table scans.

#include "postgres.h"  
#include "executor/executor.h"  
#include "nodes/plannodes.h"  
#include "utils/rel.h"  
#include "catalog/pg_class.h"  
#include "catalog/namespace.h"  
#include "utils/lsyscache.h"  
  
PG_MODULE_MAGIC;  
  
void _PG_init(void);  
void _PG_fini(void);  
static void myExecutorStart(QueryDesc *queryDesc, int eflags);  
static void findSeqScan(Plan *plan, QueryDesc *queryDesc);  
  
static ExecutorStart_hook_type prev_ExecutorStart = NULL;  
  
/* Recursively traverse the execution plan tree to find the SeqScan node */  
static void findSeqScan(Plan *plan, QueryDesc *queryDesc) {  
    if (plan == NULL) {  
        return;  
    }  
  
    if (IsA(plan, SeqScan)) {  
        Oid relid = ((SeqScan *)plan)->scanrelid;  
        char *relname = get_rel_name(relid);  
        Oid schema_oid = get_rel_namespace(relid);  
        char *schemaname = get_namespace_name(schema_oid);  
        char *dbname = get_database_name(MyDatabaseId);  
  
        elog(LOG, "Detected SeqScan: Database: %s, Schema: %s, Table: %s, SQL: %s",  
             dbname, schemaname, relname, queryDesc->sourceText);  
    }  
  
    /* Traverse child nodes */  
    ListCell *cell;  
    foreach(cell, plan->initPlan) {  
        SubPlan *subplan = (SubPlan *) lfirst(cell);  
        Plan *subplan_plan = planner_subplan_get_plan(queryDesc->plannedstmt, subplan);  
        findSeqScan(subplan_plan, queryDesc);  
    }  
  
    findSeqScan(plan->lefttree, queryDesc);  
    findSeqScan(plan->righttree, queryDesc);  
}  
  
static void myExecutorStart(QueryDesc *queryDesc, int eflags) {  
    if (prev_ExecutorStart) {  
        prev_ExecutorStart(queryDesc, eflags);  
    } else {  
        standard_ExecutorStart(queryDesc, eflags);  
    }  
  
    /* Start recursively traversing the execution plan tree */  
    findSeqScan(queryDesc->plannedstmt->planTree, queryDesc);  
}  
  
void _PG_init(void) {  
    prev_ExecutorStart = ExecutorStart_hook;  
    ExecutorStart_hook = myExecutorStart;  
}  
  
void _PG_fini(void) {  
    ExecutorStart_hook = prev_ExecutorStart;  
}

In this code, we define a findSeqScan function to recursively traverse the execution plan tree. At each node, it checks if it is a SeqScan node. If so, it gets the name of the associated table, schema, and database, and records this information in logs.

In addition, for nodes that contain subqueries or subplans, we also recursively traverse these subnodes. The initPlan list contains SubPlan nodes in the execution plan, which may be subqueries or CTEs. By calling the planner_subplan_get_plan, we can get the Plan structure corresponding to each subplan and recursively traverse them.

Note that this sample code is simplified for clearer illustration. In a real deployment, you may need to deal with more node types and consider different cases such as parametric scans, functional scans, and so on. You also need to perform adequate testing to ensure that the hook logic does not interfere with the normal operation of the database and does not introduce unnecessary performance overhead.

Community

Implementing PostgreSQL Hook: Stats on Tables with Full Scans & Corresponding SQLs

Background

Demo 1

Demo 2

Demo 3

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

Database for FinTech Solution

Oracle Database Migration Solution

Database Migration Solution