This tutorial shows how to use PolarDB for PostgreSQL to detect insurance claim fraud using graph queries. Starting from a publicly available insurance dataset, you will set up the Apache AGE graph engine, import the data, and run graph queries to identify abnormal claim records and fraudulent groups.
The tutorial covers four query patterns:
-
K-nearest neighbors — Find all claims linked to the same insured patient as a known fraudulent claim
-
Path retrieval — Trace the connection path between two policyholders
-
Common neighbors — Identify shared policyholders between two claims
-
Collaborative recommendation — Surface other potentially fraudulent policies from the same policyholder
About the graph engine
PolarDB for PostgreSQL supports storage and querying of graph data through its Apache AGE extension. It lets you run both ANSI SQL and the openCypher graph query language against the same database cluster.
Apache AGE integrates directly with PolarDB for PostgreSQL as an extension — no separate graph database is needed.
Key capabilities:
-
Native PostgreSQL compatibility — AGE works with your existing PolarDB database. No reconstruction required. It inherits all PolarDB features: transaction support, concurrency control, and indexing.
-
Unified graph and relational queries — Run SQL and Cypher in a single query. This simplifies handling of complex data models that span relational and graph structures.
-
Cypher query language — AGE supports the Cypher query language, which provides an intuitive syntax for querying and manipulating graph data.
-
High performance — AGE combines PolarDB's query optimization with graph-specific indexes to handle large-scale graph data and complex traversals efficiently.
Scenario
Insurance claim fraud typically involves interconnected data on patients, diseases, and claims. By modeling these relationships as a graph, you can detect patterns that are difficult to find in relational tables alone — such as a group of policyholders with overlapping claims and shared social connections.
Data model
The tutorial uses a publicly available insurance dataset from IBM. You can find the dataset here.
The data model consists of the following graph elements:
Vertices (nodes):
| Vertex | Key properties | Description |
|---|---|---|
policyholder |
policyholder_id, fname, lname, risk_score (0–100), high_risk (0 or 1) |
The person who holds the insurance policy. A high_risk value of 1 indicates higher fraud risk. |
claim |
claim_id, charge, claim_date, duration, diagnosis, type |
An insurance claim filed against a policy. |
patient |
insured_id |
The insured person named in the claim. |
incharge |
incharge_id |
The person in charge of processing the claim. |
disease |
disease_name |
A diagnosis linked to a patient. |
Edges (relationships):
| Edge | Direction | Description |
|---|---|---|
policyholder_of_claim |
claim → policyholder | Links a claim to its policyholder. |
insured_of_claim |
claim → patient | Links a claim to the insured patient. |
incharge_of_claim |
claim → incharge | Links a claim to the person in charge. |
has_disease |
patient → disease | Links a patient to their diagnosis. |
similar_claim |
claim → claim | Links claims with a computed similarity score. |
policyholder_connection |
policyholder → policyholder | Social connection between policyholders, with a connection level score. |
Prerequisites
Before you begin, ensure that you have:
-
A PolarDB for PostgreSQL 14 cluster with revision version 2.0.14.12.24.0 or later. For setup instructions, see Getting Started.
-
Access to a PostgreSQL client or similar tool for running SQL statements.
-
Python 3 installed locally, for running the data conversion script in the appendix.
Set up the database
-
Create the AGE extension in your database.
CREATE EXTENSION age; -
Add AGE to the
search_pathandsession_preload_librariesfor your database and user account.NoteCompatibility issues may occur when using Data Management Service (DMS) to configure
search_path. In that case, use PolarDB-Tools to run these statements instead.ALTER DATABASE <dbname> SET search_path = public,ag_catalog; ALTER USER <username> SET search_path = public,ag_catalog; ALTER DATABASE <dbname> SET session_preload_libraries TO 'age'; ALTER USER <username> SET session_preload_libraries TO 'age';
Import data
The dataset is in CSV format and does not include the ID fields required by the graph schema. Before importing, run the Python script in the appendix to convert the CSV files into SQL statements.
After running the script, save the output as an SQL file and execute it using your PostgreSQL client.
The import process follows these steps:
-
Create a graph named
graph.SELECT create_graph('graph'); -
Create vertex labels and insert vertices. The script generates statements for each vertex type. For example, for
policyholder:SELECT create_vlabel('graph','policyholder'); SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3068',fname:'ADAM',lname:'OCHSENBEIN',risk_score:'88',high_risk:'1'}) $$ ) as (n agtype); SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3069',fname:'MALINDA',lname:'MEHSERLE',risk_score:'42',high_risk:'0'}) $$ ) as (n agtype); SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3070',fname:'SANDRA',lname:'KUHTA',risk_score:'20',high_risk:'0'}) $$ ) as (n agtype); ...And for
claim:SELECT create_vlabel('graph','claim'); SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3571',charge:'6517.53',claim_date:'2013-08-11 00:00:00',duration:'13',insured_id:'28523',diagnosis:'no exception',person_incharge_id:'PI23070',type:'services',policyholder_id:'PH9507'}) $$ ) as (n agtype); SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3572',charge:'49273.65',claim_date:'2017-02-10 00:00:00',duration:'3',insured_id:'1220',diagnosis:'no exception',person_incharge_id:'PI21197',type:'services',policyholder_id:'PH406'}) $$ ) as (n agtype); SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3573',charge:'52005.98',claim_date:'2014-06-29 00:00:00',duration:'27',insured_id:'23735',diagnosis:'no exception',person_incharge_id:'PI22361',type:'services',policyholder_id:'PH7911'}) $$ ) as (n agtype); ... -
Create edges. For example, connections between claims and policyholders:
SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1528' AND b.policyholder_id = 'PH2963' CREATE (a)-[e:RELTYPE ]->(b) RETURN e$$) as (e agtype); SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1529' AND b.policyholder_id = 'PH1353' CREATE (a)-[e:RELTYPE ]->(b) RETURN e$$) as (e agtype); SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1530' AND b.policyholder_id = 'PH1071' CREATE (a)-[e:RELTYPE ]->(b) RETURN e$$) as (e agtype); ...
Query examples
The following examples use claim C4377 as a known fraudulent claim. Each query demonstrates a different fraud detection pattern.
Statistics
Verify the dataset was imported correctly by checking node counts.
-
Total nodes:
SELECT count(*) FROM cypher('graph', $$ MATCH (v) RETURN v $$) as (v agtype);count -------- 120567 -
Claims:
SELECT count(*) FROM cypher('graph', $$ MATCH (v:claim) RETURN v $$) as (v agtype);count -------- 100001 -
Policyholders:
SELECT count(*) FROM cypher('graph', $$ MATCH (v:policyholder) RETURN v $$) as (v agtype);count ------- 10006 -
Persons in charge:
SELECT count(*) FROM cypher('graph', $$ MATCH (v:incharge) RETURN v $$) as (v agtype);count ------- 10001 -
Diseases:
SELECT count(*) FROM cypher('graph', $$ MATCH (v:disease) RETURN v $$) as (v agtype);count ------- 393 -
Patients:
SELECT count(*) FROM cypher('graph', $$ MATCH (v:patient) RETURN v $$) as (v agtype);count ------- 166
Filter and sort
Look up the policyholder, person in charge, and patient associated with claim C4377. This query uses SQL UNION to combine three separate Cypher traversals — one for each connected vertex type — into a single result set.
SELECT 'policyholder_id' as type, policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(policyholder:policyholder)
RETURN policyholder.policyholder_id
$$) AS (policyholder_id agtype)
UNION
SELECT 'incharge_id', incharge_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(v:incharge)
RETURN v.incharge_id
$$) AS (incharge_id agtype)
UNION
SELECT 'patient_id', patient_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(v:patient)
RETURN v.patient_id
$$) AS (patient_id agtype); type | policyholder_id
-----------------+-----------------
patient_id | "11279"
policyholder_id | "PH3759"
incharge_id | "PI26607"
K-nearest neighbors
Find claims with the same insured patient
Identify all other claims that share the same insured patient as fraudulent claim C4377. A patient appearing in multiple claims is a potential fraud signal.
The query navigates from C4377 to its patient vertex, then follows inbound edges to find other claims connected to that same patient.
SELECT 'claim_id', claim_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:patient)<-[]-(c:claim)
RETURN c.claim_id
$$) AS (claim_id agtype);?column? | claim_id
----------+----------
claim_id | "C28963"
claim_id | "C3679"
claim_id | "C96545"
claim_id | "C26586"
claim_id | "C26754"
claim_id | "C87278"
claim_id | "C87603"
claim_id | "C69395"
claim_id | "C67594"
claim_id | "C96155"
claim_id | "C10160"
Find connected policyholders via social relationships
Retrieve the policyholders connected to the policyholder of claim C4377 within up to 3 hops. This surfaces the policyholder's social network, helping identify potential fraud rings.
The query uses r*1..3 to traverse 1 to 3 levels of policyholder_connection edges outward from the policyholder of C4377.
SELECT 'policyholder_id', policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(a:policyholder)-[r*1..3]->(p:policyholder)
RETURN p.policyholder_id
$$) AS (policyholder_id agtype);?column? | policyholder_id
-----------------+-----------------
policyholder_id | "PH52532"
policyholder_id | "PH11283"
policyholder_id | "PH11328"
policyholder_id | "PH1"
policyholder_id | "PH5"
policyholder_id | "PH512"
policyholder_id | "PH1569"
policyholder_id | "PH4722"
policyholder_id | "PH4731"
Path retrieval
Trace the exact connection path between policyholders PH3759 and PH4722 to understand how they are related.
The query matches any path of 1 to 3 hops between the two policyholders and returns the full path, including all intermediate nodes and edge properties.
SELECT *
FROM cypher('graph', $$
MATCH path = (:policyholder {policyholder_id: 'PH3759'})-[r*1..3]->(:policyholder {policyholder_id: 'PH4722'})
RETURN path
$$) AS (v agtype);-------
[{"id": 844424930136988, "label": "policyholder", "properties": {"fname": "KURTIS", "lname": "ALKEMA", "high_risk": "1", "risk_score": "78", "policyholder_id": "PH3759"}}::vertex, {"id": 2251799813685487, "label": "RELTYPE", "end_id": 844424930133473, "start_id": 844424930136988, "properties": {"level": "65"}}::edge, {"id": 844424930133473, "label": "policyholder", "properties": {"fname": "TERRA", "lname": "SWARB", "high_risk": "0", "risk_score": "25", "policyholder_id": "PH512"}}::vertex, {"id": 2251799813685546, "label": "RELTYPE", "end_id": 844424930138502, "start_id": 844424930133473, "properties": {"level": "62"}}::edge, {"id": 844424930138502, "label": "policyholder", "properties": {"fname": "VETA", "lname": "SEDLACK", "high_risk": "0", "risk_score": "31", "policyholder_id": "PH1569"}}::vertex, {"id": 2251799813685594, "label": "RELTYPE", "end_id": 844424930136281, "start_id": 844424930138502, "properties": {"level": "92"}}::edge, {"id": 844424930136281, "label": "policyholder", "properties": {"fname": "DEANNA", "lname": "BALSER", "high_risk": "0", "risk_score": "36", "policyholder_id": "PH4722"}}::vertex]::path
The path shows that PH3759 (KURTIS ALKEMA, high_risk: 1) connects to PH4722 (DEANNA BALSER) through two intermediate policyholders: PH512 and PH1569.
Common neighbors
Find policyholders shared between claim C4377 and claim C67594. A shared policyholder between two claims is a strong indicator of a coordinated fraud attempt.
The query matches policyholders that have inbound edges from both C4377 and C67594.
SELECT 'policyholder_id', policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:policyholder)<-[]-(:claim {claim_id: 'C67594'})
RETURN p.policyholder_id
$$) AS (policyholder_id agtype); ?column? | policyholder_id
-----------------+-----------------
policyholder_id | "PH3759"
Both claims share policyholder PH3759, confirming a direct link between them.
Collaborative recommendation
Find other claims from the same policyholder
Retrieve all other claims filed under the same policyholder as fraudulent claim C4377. If one claim in a portfolio is fraudulent, other claims from the same policyholder warrant review.
The query navigates from C4377 to its policyholder, then follows inbound edges to find all other claims linked to that policyholder.
SELECT 'claim_id', claim_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:policyholder)<-[]-(c:claim)
RETURN c.claim_id
$$) AS (claim_id agtype);?column? | claim_id
----------+----------
claim_id | "C28963"
claim_id | "C96545"
claim_id | "C3679"
claim_id | "C87603"
claim_id | "C26754"
claim_id | "C26586"
claim_id | "C87278"
claim_id | "C69395"
claim_id | "C67594"
claim_id | "C96155"
claim_id | "C10160"
Find the most similar claims
Retrieve the top 20 claims most similar to fraudulent claim C4377, ranked by similarity score. The similar_claim edges carry a similarity_score property; higher scores indicate greater similarity.
WITH t AS (
SELECT claim_id, replace(trim(both '"' from to_jsonb(properties)::text), '\"', '"') AS similarity FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[e]->(c:claim)
RETURN properties(e), c.claim_id
$$) AS (properties agtype, claim_id agtype)
)
SELECT claim_id, replace((similarity::jsonb->'similarity_score')::text, '"','')::integer AS s
FROM t
ORDER BY s DESC
LIMIT 20; claim_id | s
----------+----
"C67594" | 13
"C69395" | 13
"C10160" | 13
"C87603" | 13
"C28963" | 13
"C3679" | 13
"C26754" | 13
"C96155" | 13
"C26586" | 13
"C87278" | 13
"C96545" | 13
"C20113" | 8
"C70759" | 8
"C28785" | 8
"C12793" | 8
"C59736" | 8
"C38059" | 8
"C34068" | 8
"C71827" | 8
"C15760" | 8
The top 11 claims all score 13, indicating a cluster of highly similar claims that likely belong to the same fraud ring.
Conclusion
Apache AGE on PolarDB for PostgreSQL provides an efficient path to graph-based fraud detection without migrating data out of your relational database. By combining Cypher traversals with SQL aggregations, you can implement k-nearest neighbor searches, path tracing, common neighbor detection, and similarity-based clustering — all within a single database cluster.
Free trial
Visit the Free Trials & Special Offers page, select the PolarDB option, and try the graph computing features of GanosBase provided by PolarDB for PostgreSQL.
Appendix
The following Python script converts CSV files from the IBM dataset into SQL statements that can be imported into PolarDB using the AGE extension. Run the script against the analyzing-insurance-claims-using-ibm-db2-graph-master/data directory.
import csv
import os
def convert_vertex_csv(file_path, graph):
file_name = os.path.splitext(os.path.basename(file_path))[0].lower()
# create vlabel
print("------------------------------------------------")
print("-- Create vlabel")
print("SELECT create_vlabel('{}','{}');".format(graph, file_name))
with open(file_path, 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
header = next(reader)
for row in reader:
p = ""
for h in header:
if p != "":
p += ","
else:
p += "{"
p += "{}:'{}'".format(h.lower(), row[header.index(h)].strip())
if p != "":
p += "}"
print("SELECT * FROM cypher('{}', $$ CREATE (:{} {}) $$ ) as (n agtype);".format(graph, file_name, p))
def convert_edge_csv(file_path, graph, from_type, to_type):
file_name = os.path.splitext(os.path.basename(file_path))[0].lower()
with open(file_path, 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
header = next(reader)
for row in reader:
p = ""
for h in header:
if (h.endswith("ID")):
continue;
if p != "":
p += ","
else:
p += "{"
p += "{}:'{}'".format(h.lower(), row[header.index(h)].strip())
if p != "":
p += "}"
print("SELECT * FROM cypher('{0}', $$ MATCH (a:{1}), (b:{2}) WHERE a.{1}_id = '{3}' AND "
"b.{2}_id = '{4}' CREATE (a)-[e:RELTYPE {5} ]->(b) RETURN e$$) as (e agtype);".format(graph, from_type, to_type, row[0].strip(), row[1].strip(), p))
def generate_graph_csv(directory, graph):
print("------------------------------------------------")
print("-- Create graph")
print("SELECT create_graph('{}');".format(graph))
print("------------------------------------------------")
print("-- Create vertex")
convert_vertex_csv(directory + "/POLICYHOLDER.csv", graph)
convert_vertex_csv(directory + "/INCHARGE.csv", graph)
convert_vertex_csv(directory + "/PATIENT.csv", graph)
convert_vertex_csv(directory + "/CLAIM.csv", graph)
convert_vertex_csv(directory + "/DISEASE.csv", graph)
print("------------------------------------------------")
print("-- Create edge")
convert_edge_csv(directory + "/POLICYHOLDER_CONNECTION.csv", graph, 'policyholder','policyholder')
convert_edge_csv(directory + "/INCHARGE_OF_CLAIM.csv", graph,'claim', 'incharge')
convert_edge_csv(directory + "/CLAIM_SIMILARITY.csv", graph, 'claim','claim')
convert_edge_csv(directory + "/POLICYHOLDER_OF_CLAIM.csv", graph,'claim', 'policyholder')
convert_edge_csv(directory + "/INSURED_OF_CLAIM.csv", graph, 'claim','patient')
convert_edge_csv(directory + "/HAS_DISEASE.csv", graph, 'patient','disease')
generate_graph_csv("analyzing-insurance-claims-using-ibm-db2-graph-master/data", "graph")