All Products
Search
Document Center

PolarDB:Graph analysis based on PolarDB: Insurance data analysis

Last Updated:Mar 30, 2026

This tutorial shows how to use PolarDB for PostgreSQL to detect insurance claim fraud using graph queries. Starting from a publicly available insurance dataset, you will set up the Apache AGE graph engine, import the data, and run graph queries to identify abnormal claim records and fraudulent groups.

The tutorial covers four query patterns:

  • K-nearest neighbors — Find all claims linked to the same insured patient as a known fraudulent claim

  • Path retrieval — Trace the connection path between two policyholders

  • Common neighbors — Identify shared policyholders between two claims

  • Collaborative recommendation — Surface other potentially fraudulent policies from the same policyholder

About the graph engine

PolarDB for PostgreSQL supports storage and querying of graph data through its Apache AGE extension. It lets you run both ANSI SQL and the openCypher graph query language against the same database cluster.

Apache AGE integrates directly with PolarDB for PostgreSQL as an extension — no separate graph database is needed.

image

Key capabilities:

  • Native PostgreSQL compatibility — AGE works with your existing PolarDB database. No reconstruction required. It inherits all PolarDB features: transaction support, concurrency control, and indexing.

  • Unified graph and relational queries — Run SQL and Cypher in a single query. This simplifies handling of complex data models that span relational and graph structures.

  • Cypher query language — AGE supports the Cypher query language, which provides an intuitive syntax for querying and manipulating graph data.

  • High performance — AGE combines PolarDB's query optimization with graph-specific indexes to handle large-scale graph data and complex traversals efficiently.

Scenario

Insurance claim fraud typically involves interconnected data on patients, diseases, and claims. By modeling these relationships as a graph, you can detect patterns that are difficult to find in relational tables alone — such as a group of policyholders with overlapping claims and shared social connections.

Data model

The tutorial uses a publicly available insurance dataset from IBM. You can find the dataset here.

The data model consists of the following graph elements:

Vertices (nodes):

Vertex Key properties Description
policyholder policyholder_id, fname, lname, risk_score (0–100), high_risk (0 or 1) The person who holds the insurance policy. A high_risk value of 1 indicates higher fraud risk.
claim claim_id, charge, claim_date, duration, diagnosis, type An insurance claim filed against a policy.
patient insured_id The insured person named in the claim.
incharge incharge_id The person in charge of processing the claim.
disease disease_name A diagnosis linked to a patient.

Edges (relationships):

Edge Direction Description
policyholder_of_claim claim → policyholder Links a claim to its policyholder.
insured_of_claim claim → patient Links a claim to the insured patient.
incharge_of_claim claim → incharge Links a claim to the person in charge.
has_disease patient → disease Links a patient to their diagnosis.
similar_claim claim → claim Links claims with a computed similarity score.
policyholder_connection policyholder → policyholder Social connection between policyholders, with a connection level score.
image

Prerequisites

Before you begin, ensure that you have:

  • A PolarDB for PostgreSQL 14 cluster with revision version 2.0.14.12.24.0 or later. For setup instructions, see Getting Started.

  • Access to a PostgreSQL client or similar tool for running SQL statements.

  • Python 3 installed locally, for running the data conversion script in the appendix.

Set up the database

  1. Create the AGE extension in your database.

    CREATE EXTENSION age;
  2. Add AGE to the search_path and session_preload_libraries for your database and user account.

    Note

    Compatibility issues may occur when using Data Management Service (DMS) to configure search_path. In that case, use PolarDB-Tools to run these statements instead.

    ALTER DATABASE <dbname> SET search_path = public,ag_catalog;
    ALTER USER <username> SET search_path = public,ag_catalog;
    
    ALTER DATABASE <dbname> SET session_preload_libraries TO 'age';
    ALTER USER <username> SET session_preload_libraries TO 'age';

Import data

The dataset is in CSV format and does not include the ID fields required by the graph schema. Before importing, run the Python script in the appendix to convert the CSV files into SQL statements.

After running the script, save the output as an SQL file and execute it using your PostgreSQL client.

The import process follows these steps:

  1. Create a graph named graph.

    SELECT create_graph('graph');
  2. Create vertex labels and insert vertices. The script generates statements for each vertex type. For example, for policyholder:

    SELECT create_vlabel('graph','policyholder');
    SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3068',fname:'ADAM',lname:'OCHSENBEIN',risk_score:'88',high_risk:'1'}) $$ ) as (n agtype);
    SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3069',fname:'MALINDA',lname:'MEHSERLE',risk_score:'42',high_risk:'0'}) $$ ) as (n agtype);
    SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3070',fname:'SANDRA',lname:'KUHTA',risk_score:'20',high_risk:'0'}) $$ ) as (n agtype);
    ...

    And for claim:

    SELECT create_vlabel('graph','claim');
    SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3571',charge:'6517.53',claim_date:'2013-08-11 00:00:00',duration:'13',insured_id:'28523',diagnosis:'no exception',person_incharge_id:'PI23070',type:'services',policyholder_id:'PH9507'}) $$ ) as (n agtype);
    SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3572',charge:'49273.65',claim_date:'2017-02-10 00:00:00',duration:'3',insured_id:'1220',diagnosis:'no exception',person_incharge_id:'PI21197',type:'services',policyholder_id:'PH406'}) $$ ) as (n agtype);
    SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3573',charge:'52005.98',claim_date:'2014-06-29 00:00:00',duration:'27',insured_id:'23735',diagnosis:'no exception',person_incharge_id:'PI22361',type:'services',policyholder_id:'PH7911'}) $$ ) as (n agtype);
    ...
  3. Create edges. For example, connections between claims and policyholders:

    SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1528' AND b.policyholder_id = 'PH2963' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
    SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1529' AND b.policyholder_id = 'PH1353' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
    SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1530' AND b.policyholder_id = 'PH1071' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
    ...

Query examples

The following examples use claim C4377 as a known fraudulent claim. Each query demonstrates a different fraud detection pattern.

Statistics

Verify the dataset was imported correctly by checking node counts.

  • Total nodes:

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v)
    RETURN v
    $$) as (v agtype);
    count
    --------
     120567
  • Claims:

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:claim)
    RETURN v
    $$) as (v agtype);
    count
    --------
     100001
  • Policyholders:

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:policyholder)
    RETURN v
    $$) as (v agtype);
    count
    -------
     10006
  • Persons in charge:

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:incharge)
    RETURN v
    $$) as (v agtype);
    count
    -------
     10001
  • Diseases:

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:disease)
    RETURN v
    $$) as (v agtype);
    count
    -------
       393
  • Patients:

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:patient)
    RETURN v
    $$) as (v agtype);
    count
    -------
       166

Filter and sort

Look up the policyholder, person in charge, and patient associated with claim C4377. This query uses SQL UNION to combine three separate Cypher traversals — one for each connected vertex type — into a single result set.

SELECT 'policyholder_id' as type, policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(policyholder:policyholder)
RETURN policyholder.policyholder_id
$$) AS (policyholder_id agtype)
UNION
SELECT 'incharge_id', incharge_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(v:incharge)
RETURN v.incharge_id
$$) AS (incharge_id agtype)
UNION
SELECT 'patient_id', patient_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(v:patient)
RETURN v.patient_id
$$) AS (patient_id agtype);
      type       | policyholder_id
-----------------+-----------------
 patient_id      | "11279"
 policyholder_id | "PH3759"
 incharge_id     | "PI26607"

K-nearest neighbors

Find claims with the same insured patient

Identify all other claims that share the same insured patient as fraudulent claim C4377. A patient appearing in multiple claims is a potential fraud signal.

The query navigates from C4377 to its patient vertex, then follows inbound edges to find other claims connected to that same patient.

SELECT 'claim_id', claim_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:patient)<-[]-(c:claim)
RETURN c.claim_id
$$) AS (claim_id agtype);
?column? | claim_id
----------+----------
 claim_id | "C28963"
 claim_id | "C3679"
 claim_id | "C96545"
 claim_id | "C26586"
 claim_id | "C26754"
 claim_id | "C87278"
 claim_id | "C87603"
 claim_id | "C69395"
 claim_id | "C67594"
 claim_id | "C96155"
 claim_id | "C10160"

Find connected policyholders via social relationships

Retrieve the policyholders connected to the policyholder of claim C4377 within up to 3 hops. This surfaces the policyholder's social network, helping identify potential fraud rings.

The query uses r*1..3 to traverse 1 to 3 levels of policyholder_connection edges outward from the policyholder of C4377.

SELECT 'policyholder_id', policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(a:policyholder)-[r*1..3]->(p:policyholder)
RETURN p.policyholder_id
$$) AS (policyholder_id agtype);
?column?     | policyholder_id
-----------------+-----------------
 policyholder_id | "PH52532"
 policyholder_id | "PH11283"
 policyholder_id | "PH11328"
 policyholder_id | "PH1"
 policyholder_id | "PH5"
 policyholder_id | "PH512"
 policyholder_id | "PH1569"
 policyholder_id | "PH4722"
 policyholder_id | "PH4731"

Path retrieval

Trace the exact connection path between policyholders PH3759 and PH4722 to understand how they are related.

The query matches any path of 1 to 3 hops between the two policyholders and returns the full path, including all intermediate nodes and edge properties.

SELECT *
FROM cypher('graph', $$
    MATCH path = (:policyholder {policyholder_id: 'PH3759'})-[r*1..3]->(:policyholder {policyholder_id: 'PH4722'})
    RETURN path
$$) AS (v agtype);
-------
 [{"id": 844424930136988, "label": "policyholder", "properties": {"fname": "KURTIS", "lname": "ALKEMA", "high_risk": "1", "risk_score": "78", "policyholder_id": "PH3759"}}::vertex, {"id": 2251799813685487, "label": "RELTYPE", "end_id": 844424930133473, "start_id": 844424930136988, "properties": {"level": "65"}}::edge, {"id": 844424930133473, "label": "policyholder", "properties": {"fname": "TERRA", "lname": "SWARB", "high_risk": "0", "risk_score": "25", "policyholder_id": "PH512"}}::vertex, {"id": 2251799813685546, "label": "RELTYPE", "end_id": 844424930138502, "start_id": 844424930133473, "properties": {"level": "62"}}::edge, {"id": 844424930138502, "label": "policyholder", "properties": {"fname": "VETA", "lname": "SEDLACK", "high_risk": "0", "risk_score": "31", "policyholder_id": "PH1569"}}::vertex, {"id": 2251799813685594, "label": "RELTYPE", "end_id": 844424930136281, "start_id": 844424930138502, "properties": {"level": "92"}}::edge, {"id": 844424930136281, "label": "policyholder", "properties": {"fname": "DEANNA", "lname": "BALSER", "high_risk": "0", "risk_score": "36", "policyholder_id": "PH4722"}}::vertex]::path

The path shows that PH3759 (KURTIS ALKEMA, high_risk: 1) connects to PH4722 (DEANNA BALSER) through two intermediate policyholders: PH512 and PH1569.

Common neighbors

Find policyholders shared between claim C4377 and claim C67594. A shared policyholder between two claims is a strong indicator of a coordinated fraud attempt.

The query matches policyholders that have inbound edges from both C4377 and C67594.

SELECT 'policyholder_id', policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:policyholder)<-[]-(:claim  {claim_id: 'C67594'})
RETURN p.policyholder_id
$$) AS (policyholder_id agtype);
    ?column?     | policyholder_id
-----------------+-----------------
 policyholder_id | "PH3759"

Both claims share policyholder PH3759, confirming a direct link between them.

Collaborative recommendation

Find other claims from the same policyholder

Retrieve all other claims filed under the same policyholder as fraudulent claim C4377. If one claim in a portfolio is fraudulent, other claims from the same policyholder warrant review.

The query navigates from C4377 to its policyholder, then follows inbound edges to find all other claims linked to that policyholder.

SELECT 'claim_id', claim_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:policyholder)<-[]-(c:claim)
RETURN c.claim_id
$$) AS (claim_id agtype);
?column? | claim_id
----------+----------
 claim_id | "C28963"
 claim_id | "C96545"
 claim_id | "C3679"
 claim_id | "C87603"
 claim_id | "C26754"
 claim_id | "C26586"
 claim_id | "C87278"
 claim_id | "C69395"
 claim_id | "C67594"
 claim_id | "C96155"
 claim_id | "C10160"

Find the most similar claims

Retrieve the top 20 claims most similar to fraudulent claim C4377, ranked by similarity score. The similar_claim edges carry a similarity_score property; higher scores indicate greater similarity.

WITH t AS (
SELECT claim_id, replace(trim(both '"' from to_jsonb(properties)::text), '\"', '"') AS similarity FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[e]->(c:claim)
RETURN properties(e), c.claim_id
$$) AS (properties agtype, claim_id agtype)
)
SELECT claim_id, replace((similarity::jsonb->'similarity_score')::text, '"','')::integer AS s
FROM t
ORDER BY s DESC
LIMIT 20;
 claim_id | s
----------+----
 "C67594" | 13
 "C69395" | 13
 "C10160" | 13
 "C87603" | 13
 "C28963" | 13
 "C3679"  | 13
 "C26754" | 13
 "C96155" | 13
 "C26586" | 13
 "C87278" | 13
 "C96545" | 13
 "C20113" |  8
 "C70759" |  8
 "C28785" |  8
 "C12793" |  8
 "C59736" |  8
 "C38059" |  8
 "C34068" |  8
 "C71827" |  8
 "C15760" |  8

The top 11 claims all score 13, indicating a cluster of highly similar claims that likely belong to the same fraud ring.

Conclusion

Apache AGE on PolarDB for PostgreSQL provides an efficient path to graph-based fraud detection without migrating data out of your relational database. By combining Cypher traversals with SQL aggregations, you can implement k-nearest neighbor searches, path tracing, common neighbor detection, and similarity-based clustering — all within a single database cluster.

Free trial

Visit the Free Trials & Special Offers page, select the PolarDB option, and try the graph computing features of GanosBase provided by PolarDB for PostgreSQL.

Appendix

The following Python script converts CSV files from the IBM dataset into SQL statements that can be imported into PolarDB using the AGE extension. Run the script against the analyzing-insurance-claims-using-ibm-db2-graph-master/data directory.

import csv
import os


def convert_vertex_csv(file_path, graph):
    file_name = os.path.splitext(os.path.basename(file_path))[0].lower()

    # create vlabel
    print("------------------------------------------------")
    print("-- Create vlabel")
    print("SELECT create_vlabel('{}','{}');".format(graph, file_name))

    with open(file_path, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        header = next(reader)
        for row in reader:
            p = ""
            for h in header:
                if p != "":
                    p += ","
                else:
                    p += "{"
                p += "{}:'{}'".format(h.lower(), row[header.index(h)].strip())
            if p != "":
               p += "}"
            print("SELECT * FROM cypher('{}', $$ CREATE (:{} {}) $$ ) as (n agtype);".format(graph, file_name, p))


def convert_edge_csv(file_path, graph, from_type, to_type):
    file_name = os.path.splitext(os.path.basename(file_path))[0].lower()

    with open(file_path, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        header = next(reader)
        for row in reader:
            p = ""
            for h in header:
                if (h.endswith("ID")):
                    continue;

                if p != "":
                    p += ","
                else:
                    p += "{"
                p += "{}:'{}'".format(h.lower(), row[header.index(h)].strip())

            if p != "":
               p += "}"

            print("SELECT * FROM cypher('{0}', $$ MATCH (a:{1}), (b:{2}) WHERE a.{1}_id = '{3}' AND "
                  "b.{2}_id = '{4}' CREATE (a)-[e:RELTYPE {5} ]->(b) RETURN e$$) as (e agtype);".format(graph, from_type, to_type, row[0].strip(), row[1].strip(), p))


def generate_graph_csv(directory, graph):

    print("------------------------------------------------")
    print("-- Create graph")
    print("SELECT create_graph('{}');".format(graph))

    print("------------------------------------------------")
    print("-- Create vertex")
    convert_vertex_csv(directory + "/POLICYHOLDER.csv", graph)
    convert_vertex_csv(directory + "/INCHARGE.csv", graph)
    convert_vertex_csv(directory + "/PATIENT.csv", graph)
    convert_vertex_csv(directory + "/CLAIM.csv", graph)
    convert_vertex_csv(directory + "/DISEASE.csv", graph)

    print("------------------------------------------------")
    print("-- Create edge")
    convert_edge_csv(directory + "/POLICYHOLDER_CONNECTION.csv", graph, 'policyholder','policyholder')
    convert_edge_csv(directory + "/INCHARGE_OF_CLAIM.csv", graph,'claim', 'incharge')
    convert_edge_csv(directory + "/CLAIM_SIMILARITY.csv", graph, 'claim','claim')
    convert_edge_csv(directory + "/POLICYHOLDER_OF_CLAIM.csv", graph,'claim', 'policyholder')
    convert_edge_csv(directory + "/INSURED_OF_CLAIM.csv", graph, 'claim','patient')
    convert_edge_csv(directory + "/HAS_DISEASE.csv", graph, 'patient','disease')


generate_graph_csv("analyzing-insurance-claims-using-ibm-db2-graph-master/data", "graph")