All Products
Search
Document Center

PolarDB:Graph analysis based on PolarDB: Insurance data analysis

Last Updated:Feb 28, 2025

In this topic, an insurance dataset that is publicly available is used to demonstrate how to use PolarDB to perform graph queries to identify abnormal claim records and fraudulent groups in insurance claim scenarios. For example, you can query policies involving the claimant or analyze the social relationships of the policyholder for potential fraudulent behaviors. PolarDB enhances relational databases with graph analysis capabilities, providing support for centralized data management and analysis in enterprises.

About the graph database engine

Graph analysis is a critical field of data science. It uses graph structures to represent data and perform various computing and analytic tasks. A graph structure consists of nodes (or vertices) and edges, with nodes representing entities and edges representing the relationships between the entities. Graph computing is commonly used in areas including social network analysis, recommendation systems, knowledge graphs, and path optimization.

PolarDB for PostgreSQL is highly compatible with Apache AGE and supports storage and query of knowledge graphs. It allows you to use both the standard ANSI SQL and openCypher graph query language to query data within the same database cluster.

  • Fully compatible with PolarDB for PostgreSQL

    AGE is an extension for PolarDB for PostgreSQL. It can be used in existing PolarDB databases without the need for database reconstruction. AGE leverages all features provided by PolarDB, including transaction support, concurrency control, and various indexing and performance optimization features.

  • Unified graph and relational queries

    AGE enables the concurrent processing of relational and graph data. It allows you to use both SQL and the graph query language in a single query to simplify the handling of complex data models, which improves operation efficiency.

  • Cypher query language supported

    AGE supports the Cypher query language, which is tailored for graph databases. Its syntax is simple and flexible, providing an intuitive approach to querying and manipulating graph data.

  • High performance

    By combining the optimization features provided by PolarDB and indexes tailored for graph data, AGE can efficiently manage large-scale graph data and complex graph queries.

In conclusion, based on the capabilities provided by AGE, PolarDB can handle graph queries in a simple and efficient manner.

image

Scenarios

Description

Insurance claim fraud typically involves data on patients, diseases, and claims held by insurance providers. By analyzing the connections between claim applications, diseases, and other related entities, you can identify abnormal claim records and expose potential fraudulent groups.

Data and models

The data is based on an insurance dataset that is publicly available. You can find the dataset here. It contains fundamental elements of the insurance industry. The data model can be abstracted into the following graph:

  • Vertices: policyholder, incharge, claim, patient, disease.

  • Edges: has_disease, policyholder_of_claim, incharge_of_claim, insured_of_claim, similar_claim, policyholder_connection.

  • Properties: name, high_risk, risk_score, disease_name, similarity_score, level, claim_date, charge, and more.

image

Best practices

Prepare the database

Note

The graph engine extension is supported only by PolarDB for PostgreSQL 14 whose revision version is 14.12.24.0 or later. For more information, see Getting Started.

  1. Create the extension.

    CREATE EXTENSION age;
  2. Add the extension to the database or the search path and preload library of your account.

    ALTER DATABASE <dbname> SET search_path = public,ag_catalog;
    ALTER USER <username> SET search_path = public,ag_catalog;
    
    ALTER DATABASE <dbname> SET session_preload_libraries TO 'age';
    ALTER USER <username> SET session_preload_libraries TO 'age';

Data import

  1. Create a graph by using the create_graph function in the ag_catalog namespace.

    SELECT create_graph('graph');
  2. Insert vertices and edges. The downloaded data file is in the CSV format and does not contain the necessary ID information. Convert the file before you import the data. A Python script is provided in Appendix to convert the data into vertices and edges that is supported by PolarDB. See the following conversion results:

    • Policyholder

      SELECT create_vlabel('graph','policyholder');
      SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3068',fname:'ADAM',lname:'OCHSENBEIN',risk_score:'88',high_risk:'1'}) $$ ) as (n agtype);
      SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3069',fname:'MALINDA',lname:'MEHSERLE',risk_score:'42',high_risk:'0'}) $$ ) as (n agtype);
      SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3070',fname:'SANDRA',lname:'KUHTA',risk_score:'20',high_risk:'0'}) $$ ) as (n agtype);
      ...
    • Claim

      - Create vlabel
      SELECT create_vlabel('graph','claim');
      SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3571',charge:'6517.53',claim_date:'2013-08-11 00:00:00',duration:'13',insured_id:'28523',diagnosis:'no exception',person_incharge_id:'PI23070',type:'services',policyholder_id:'PH9507'}) $$ ) as (n agtype);
      SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3572',charge:'49273.65',claim_date:'2017-02-10 00:00:00',duration:'3',insured_id:'1220',diagnosis:'no exception',person_incharge_id:'PI21197',type:'services',policyholder_id:'PH406'}) $$ ) as (n agtype);
      SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3573',charge:'52005.98',claim_date:'2014-06-29 00:00:00',duration:'27',insured_id:'23735',diagnosis:'no exception',person_incharge_id:'PI22361',type:'services',policyholder_id:'PH7911'}) $$ ) as (n agtype);
      ...
    • Connections between policyholder and claim

      SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1528' AND b.policyholder_id = 'PH2963' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
      SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1529' AND b.policyholder_id = 'PH1353' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
      SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1530' AND b.policyholder_id = 'PH1071' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
      SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1531' AND b.policyholder_id = 'PH8102' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
      SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1532' AND b.policyholder_id = 'PH4768' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
      ...

    Save the transformed results as an SQL file and use client tools such as the PostgreSQL client to import the data.

Examples

Simple queries

Statistics

  • Query the number of nodes of various types.

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v)
    RETURN v
    $$) as (v agtype);

    The following result is returned:

     count  
    --------
     120567
  • Query the number of claims

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:claim)
    RETURN v
    $$) as (v agtype);

    The following result is returned:

     count  
    --------
     100001
  • Query the number of policyholders

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:policyholder)
    RETURN v
    $$) as (v agtype);

    The following result is returned:

     count 
    -------
     10006
  • Query the number of incharges

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:incharge)
    RETURN v
    $$) as (v agtype);

    The following result is returned:

     count 
    -------
     10001
  • Query the number of diseases

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:disease)
    RETURN v
    $$) as (v agtype);

    The following result is returned:

     count 
    -------
       393
  • Query the number of patients

    SELECT count(*) FROM cypher('graph', $$
    MATCH (v:patient)
    RETURN v
    $$) as (v agtype);

    The following result is returned:

     count 
    -------
       166

Filter and sort queries

Query the policyholder, incharge, and patient status of claim C4377.

SELECT 'policyholder_id' as type, policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(policyholder:policyholder)
RETURN policyholder.policyholder_id
$$) AS (policyholder_id agtype)
UNION
SELECT 'incharge_id', incharge_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(v:incharge)
RETURN v.incharge_id
$$) AS (incharge_id agtype)
UNION
SELECT 'patient_id', patient_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(v:patient)
RETURN v.patient_id
$$) AS (patient_id agtype);

The following result is returned:

      type       | policyholder_id 
-----------------+-----------------
 patient_id      | "11279"
 policyholder_id | "PH3759"
 incharge_id     | "PI26607"

Common scenarios

K-Nearest Neighbors

  • Query the claims with the same insured patient as fraudulent claim C4377 to determine whether the claimant is involved in insurance fraud.

    SELECT 'claim_id', claim_id FROM cypher('graph', $$
    MATCH (:claim {claim_id: 'C4377'})-[]->(p:patient)<-[]-(c:claim)
    RETURN c.claim_id
    $$) AS (claim_id agtype);

    The following result is returned:

     ?column? | claim_id 
    ----------+----------
     claim_id | "C28963"
     claim_id | "C3679"
     claim_id | "C96545"
     claim_id | "C26586"
     claim_id | "C26754"
     claim_id | "C87278"
     claim_id | "C87603"
     claim_id | "C69395"
     claim_id | "C67594"
     claim_id | "C96155"
     claim_id | "C10160"
  • Query the social relationships of the policyholder of fraudulent claim C4377 to provide early warnings about their claims.

    SELECT 'policyholder_id', policyholder_id FROM cypher('graph', $$
    MATCH (:claim {claim_id: 'C4377'})-[]->(a:policyholder)-[r*1..3]->(p:policyholder)
    RETURN p.policyholder_id
    $$) AS (policyholder_id agtype);

    The following result is returned:

        ?column?     | policyholder_id 
    -----------------+-----------------
     policyholder_id | "PH52532"
     policyholder_id | "PH11283"
     policyholder_id | "PH11328"
     policyholder_id | "PH1"
     policyholder_id | "PH5"
     policyholder_id | "PH512"
     policyholder_id | "PH1569"
     policyholder_id | "PH4722"
     policyholder_id | "PH4731"

Path retrieval

Query the path between policyholder PH3759 and policyholder PH4722 to check their connections.

SELECT *
FROM cypher('graph', $$                        
    MATCH path = (:policyholder {policyholder_id: 'PH3759'})-[r*1..3]->(:policyholder {policyholder_id: 'PH4722'})
    RETURN path
$$) AS (v agtype);

The following result is returned:

-------
 [{"id": 844424930136988, "label": "policyholder", "properties": {"fname": "KURTIS", "lname": "ALKEMA", "high_risk": "1", "risk_score": "78", "policyholder_id": "PH3759"}}::vertex, {"id": 2251799813685487, "label": "RELTYPE", "end_id": 844424930133473, "start_id": 844424930136988, "properties": {"level": "65"}}::edge, {"id": 844424930133473, "label": "policyholder", "properties": {"fname": "TERRA", "lname": "SWARB", "high_risk": "0", "risk_score": "25", "policyholder_id": "PH512"}}::vertex, {"id": 2251799813685546, "label": "RELTYPE", "end_id": 844424930138502, "start_id": 844424930133473, "properties": {"level": "62"}}::edge, {"id": 844424930138502, "label": "policyholder", "properties": {"fname": "VETA", "lname": "SEDLACK", "high_risk": "0", "risk_score": "31", "policyholder_id": "PH1569"}}::vertex, {"id": 2251799813685594, "label": "RELTYPE", "end_id": 844424930136281, "start_id": 844424930138502, "properties": {"level": "92"}}::edge, {"id": 844424930136281, "label": "policyholder", "properties": {"fname": "DEANNA", "lname": "BALSER", "high_risk": "0", "risk_score": "36", "policyholder_id": "PH4722"}}::vertex]::path

Common neighbors

Query common neighbors of claim C4377 and claim C67594 to identify shared policyholders.

SELECT 'policyholder_id', policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:policyholder)<-[]-(:claim  {claim_id: 'C67594'})
RETURN p.policyholder_id
$$) AS (policyholder_id agtype);

The following result is returned:

    ?column?     | policyholder_id 
-----------------+-----------------
 policyholder_id | "PH3759"

Collaborative recommendation

  • Query policies that share the same policyholder as fraudulent claim C4377 to identify other potentially fraudulent policies.

    SELECT 'claim_id', claim_id FROM cypher('graph', $$
    MATCH (:claim {claim_id: 'C4377'})-[]->(p:policyholder)<-[]-(c:claim)
    RETURN c.claim_id
    $$) AS (claim_id agtype);

    The following result is returned:

     ?column? | claim_id 
    ----------+----------
     claim_id | "C28963"
     claim_id | "C96545"
     claim_id | "C3679"
     claim_id | "C87603"
     claim_id | "C26754"
     claim_id | "C26586"
     claim_id | "C87278"
     claim_id | "C69395"
     claim_id | "C67594"
     claim_id | "C96155"
     claim_id | "C10160"
  • Query the top 20 policies most similar to fraudulent claim C4377:

    WITH t AS (
    SELECT claim_id, replace(trim(both '"' from to_jsonb(properties)::text), '\"', '"') AS similarity FROM cypher('graph', $$
    MATCH (:claim {claim_id: 'C4377'})-[e]->(c:claim)
    RETURN properties(e), c.claim_id
    $$) AS (properties agtype, claim_id agtype)
    )
    SELECT claim_id, replace((similarity::jsonb->'similarity_score')::text, '"','')::integer AS s 
    FROM t
    ORDER BY s DESC
    LIMIT 20;

    The following result is returned:

     claim_id | s  
    ----------+----
     "C67594" | 13
     "C69395" | 13
     "C10160" | 13
     "C87603" | 13
     "C28963" | 13
     "C3679"  | 13
     "C26754" | 13
     "C96155" | 13
     "C26586" | 13
     "C87278" | 13
     "C96545" | 13
     "C20113" |  8
     "C70759" |  8
     "C28785" |  8
     "C12793" |  8
     "C59736" |  8
     "C38059" |  8
     "C34068" |  8
     "C71827" |  8
     "C15760" |  8

Conclusion

You can combine the graph analysis capabilities provided by PolarDB for PostgreSQL and the AGE extension to perform graph data analysis. PolarDB provides graph data computing and analysis features, including Cypher query language support and efficient graph data query processing, to facilitate centralized data management and analysis in enterprises.

Free trial

Visit the Free Trials & Special Offers page, select the PolarDB option, and try the graph computing features of GanosBase provided by PolarDB for PostgreSQL.

Appendix

Data conversion script for converting CSV data.

import csv
import os


def convert_vertex_csv(file_path, graph):
    file_name = os.path.splitext(os.path.basename(file_path))[0].lower()

    # create vlabel
    print("------------------------------------------------")
    print("-- Create vlabel")
    print("SELECT create_vlabel('{}','{}');".format(graph, file_name))

    with open(file_path, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        header = next(reader)
        for row in reader:
            p = ""
            for h in header:
                if p != "":
                    p += ","
                else:
                    p += "{"
                p += "{}:'{}'".format(h.lower(), row[header.index(h)].strip())
            if p != "":
               p += "}"
            print("SELECT * FROM cypher('{}', $$ CREATE (:{} {}) $$ ) as (n agtype);".format(graph, file_name, p))


def convert_edge_csv(file_path, graph, from_type, to_type):
    file_name = os.path.splitext(os.path.basename(file_path))[0].lower()

    with open(file_path, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        header = next(reader)
        for row in reader:
            p = ""
            for h in header:
                if (h.endswith("ID")):
                    continue;

                if p != "":
                    p += ","
                else:
                    p += "{"
                p += "{}:'{}'".format(h.lower(), row[header.index(h)].strip())

            if p != "":
               p += "}"

            print("SELECT * FROM cypher('{0}', $$ MATCH (a:{1}), (b:{2}) WHERE a.{1}_id = '{3}' AND "
                  "b.{2}_id = '{4}' CREATE (a)-[e:RELTYPE {5} ]->(b) RETURN e$$) as (e agtype);".format(graph, from_type, to_type, row[0].strip(), row[1].strip(), p))


def generate_graph_csv(directory, graph):

    print("------------------------------------------------")
    print("-- Create graph")
    print("SELECT create_graph('{}');".format(graph))

    print("------------------------------------------------")
    print("-- Create vertex")
    convert_vertex_csv(directory + "/POLICYHOLDER.csv", graph)
    convert_vertex_csv(directory + "/INCHARGE.csv", graph)
    convert_vertex_csv(directory + "/PATIENT.csv", graph)
    convert_vertex_csv(directory + "/CLAIM.csv", graph)
    convert_vertex_csv(directory + "/DISEASE.csv", graph)

    print("------------------------------------------------")
    print("-- Create edge")
    convert_edge_csv(directory + "/POLICYHOLDER_CONNECTION.csv", graph, 'policyholder','policyholder')
    convert_edge_csv(directory + "/INCHARGE_OF_CLAIM.csv", graph,'claim', 'incharge')
    convert_edge_csv(directory + "/CLAIM_SIMILARITY.csv", graph, 'claim','claim')
    convert_edge_csv(directory + "/POLICYHOLDER_OF_CLAIM.csv", graph,'claim', 'policyholder')
    convert_edge_csv(directory + "/INSURED_OF_CLAIM.csv", graph, 'claim','patient')
    convert_edge_csv(directory + "/HAS_DISEASE.csv", graph, 'patient','disease')


generate_graph_csv("analyzing-insurance-claims-using-ibm-db2-graph-master/data", "graph")