This topic describes how to rapidly import large volumes of data by using the AGE extension provided by PolarDB for PostgreSQL. The approach allows you to efficiently import hundreds of millions of nodes and edges into a graph database. This helps avoid the query performance issue during edge insertion.
Background information
PolarDB for PostgreSQL provides the AGE extension to facilitate the storage and querying of graph data.
To import and export graph database data, you need to convert node and edge data into tables in formats like CSV, import the node table, and then create edges based on the edge table. In Cypher, however, the MATCH...CREATE
statement that is used to create edges also queries corresponding nodes. This slows down the import speed especially when importing tens of millions or even more edges.
This topic describes a new method for importing graph data. This method directly generates and manages unique IDs of nodes on the graph, allowing edges to be inserted without performing the MATCH
step. The graph import efficiency can be significantly improved. This method includes the following steps:
Import graph nodes into a common table. A column or a set of columns of the table is used as the unique ID.
Import graph edges into a common table. An edge record contains the unique ID of the left node, the unique ID of the right node, and the edge properties.
Add the unique ID column to the node table. In the edge table, add the unique ID columns for the start and end nodes by using JOIN or calculation.
Import the node table into the graph based on the unique ID column.
Import the edge table into the graph based on the unique IDs.
Prerequisites
The PolarDB for PostgreSQL cluster runs PostgreSQL 14 (revision version 14.12.24.0 or later).
Execute the following statement to query the revision version of PolarDB for PostgreSQL:
SELECT version();
If you need to upgrade the revision version, see Upgrade the version.
Best practice
The example in this topic involves the following elements:
Two node types named
v_user
andv_product
.v_user
: Because nodes of different types may have the same uid, the unique ID is created from both the type column and the uid column.v_product
: The product_id column is used as the unique ID. The values of this column are complex strings that cannot be converted into numeric IDs. Sequence numbers must be added manually.
An edge type named e_own.
Tables (named xxx_raw) into which the node and edge data is imported.
A graph named Toys into which the nodes and edges are imported.
Preparation for data import
Create the AGE extension.
CREATE EXTENSION age;
Add the extension to the database or the search path and preload libraries of the current user:
ALTER DATABASE <dbname> SET search_path = "$user", public, ag_catalog; ALTER DATABASE <dbname> SET session_preload_libraries TO 'age';
Create a graph with two node types and one edge type (v for vertex and e for edge).
---Create a graph SELECT create_graph('toys'); ---Create node types SELECT create_vlabel('toys','v_user'); SELECT create_vlabel('toys','v_product'); ---Create an edge type SELECT create_elabel('toys','e_own');
Create the node table to store the node data for import.
CREATE TABLE public.v_user_raw(type text, uid text, name text, age integer); INSERT INTO v_user_raw VALUES ('A','U1', 'Alice', '33'), ('B','U1', 'Bob', '21'); CREATE TABLE public.v_product_raw(product_id text, price double precision); INSERT INTO v_product_raw VALUES ('INAKLIDAS', '50'), ('ENKUCLKSD', '80'), ('IIUIHAKLS', '320'), ('SDVDSUHEE', '340');
Create the edge table to store the edge data for import.
CREATE TABLE public.e_own_raw(user_type text, user_uid text, product_id text, buy_price text); INSERT INTO e_own_raw VALUES ('A', 'U1', 'INAKLIDAS', '45'), ('B', 'U1', 'ENKUCLKSD', '70'), ('B', 'U1', 'INAKLIDAS', '50'), ('B', 'U1', 'SDVDSUHEE', '330');
Create unique IDs
In this example,
v_user
nodes contain only 10 types (from A to J). The unique ID is a combination ofuid
andtype
. Register this mapping relationship as a function.CREATE OR REPLACE FUNCTION v_user_idgen(type text, uid text) RETURNS bigint AS $$ SELECT (ASCII(type) - ASCII('A')) + substring(uid, 2)::bigint * 10 $$ language SQL;
Add ID columns to relevant node and edge tables:
ALTER TABLE v_user_raw ADD COLUMN _id bigint UNIQUE CHECK(_id < 281474976710656 AND _id>=0); UPDATE v_user_raw SET _id = v_user_idgen(type, uid); ALTER TABLE e_own_raw ADD COLUMN _fromid bigint CHECK(_fromid < 281474976710656 AND _fromid>=0); UPDATE e_own_raw SET _fromid = v_user_idgen(user_type, user_uid);
For
v_product
nodes, the product_id strings cannot be directly converted to numeric values of the bigint type. You can use the bigserial type provided by PostgreSQL to automatically generate sequence numbers and add the corresponding IDs to the edge table by using the join operation. Then, create an index on the columns that form the unique ID for acceleration.NoteTo create an index on multiple columns that form the unique ID, use the
CREATE INDEX ON v_product_raw(column1, column2,...);
syntax.ALTER TABLE v_product_raw ADD COLUMN _id bigserial UNIQUE CHECK(_id < 281474976710656 AND _id>=0); CREATE INDEX ON v_product_raw(product_id); ALTER TABLE e_own_raw ADD COLUMN _toid bigint CHECK(_toid < 281474976710656 AND _toid>=0); UPDATE e_own_raw SET _toid = v_product_raw._id FROM v_product_raw WHERE v_product_raw.product_id = e_own_raw.product_id;
The preceding methods have the following differences:
The registered mapping function allows for easier subsequent management. You can use the same function for easier management of the mapping between properties and nodes.
By using the bigserial type to automatically generate sequence numbers, you don't need to convert the original column to the numeric type. But you need to perform a join operation. Also, you are not able to use a simple function to map properties and nodes to perform certain operations.
Quick data import
Create auxiliary functions for generating internal unique IDs for PolarDB for PostgreSQL.
CREATE OR REPLACE FUNCTION age_name_to_idx_start(graph_name text, kind_name text, label_name text) RETURNS bigint AS 'SELECT id::bigint<<48 FROM ag_catalog.ag_label WHERE kind = kind_name and name = label_name and graph = (SELECT graphid FROM ag_catalog.ag_graph WHERE name = graph_name)' language SQL IMMUTABLE STRICT PARALLEL SAFE; CREATE OR REPLACE FUNCTION age_name_to_seq(graph_name text, kind_name text, label_name text) RETURNS text AS 'SELECT graph_name || ''.'' || seq_name::text FROM ag_catalog.ag_label WHERE kind = kind_name and name = label_name and graph = (SELECT graphid FROM ag_catalog.ag_graph WHERE name = graph_name)' language SQL IMMUTABLE STRICT PARALLEL SAFE;
Import nodes.
(Optional) Before importing nodes, you can create unique constraints to prevent duplicate IDs. These indexes can accelerate subsequent queries. You can keep them after creation.
CREATE UNIQUE INDEX ON toys.v_user(id); CREATE UNIQUE INDEX ON toys.v_product(id);
When importing nodes, select the columns from the _raw tables to add to the nodes. You can exclude newly added columns like _id from the nodes.
INSERT INTO toys."v_user" SELECT (age_name_to_idx_start('toys', 'v', 'v_user') + _id)::text::graphid, row_to_json((select x FROM (select type, uid, name, age) x))::text::agtype FROM v_user_raw; SELECT setval(age_name_to_seq('toys', 'v', 'v_user'), (SELECT max(_id) + 1 FROM v_user_raw)); INSERT INTO toys."v_product" SELECT (age_name_to_idx_start('toys', 'v', 'v_product') + _id)::text::graphid, row_to_json((select x FROM (select product_id, price) x))::text::agtype FROM v_product_raw; SELECT setval(age_name_to_seq('toys', 'v', 'v_product'), (SELECT max(_id) + 1 FROM v_product_raw));
Import edges. Edges are generated based on the _fromid and _toid tables. You can exclude the properties of the nodes on both sides and keep only the buy_price property.
INSERT INTO toys."e_own" SELECT (age_name_to_idx_start('toys', 'e', 'e_own') + nextval(age_name_to_seq('toys', 'e', 'e_own')))::text::graphid, (age_name_to_idx_start('toys', 'v', 'v_user') + _fromid)::text::graphid, (age_name_to_idx_start('toys', 'v', 'v_product') + _toid)::text::graphid, row_to_json((select x FROM (select buy_price) x))::text::agtype FROM e_own_raw;
Test verification
Query nodes.
SELECT * FROM cypher('toys', $$ MATCH (v:v_user) RETURN v $$) as (v agtype);
Sample result:
v -------------------------------------------------------------------------------------------------------------------------- {"id": 844424930131978, "label": "v_user", "properties": {"age": 33, "uid": "U1", "name": "Alice", "type": "A"}}::vertex {"id": 844424930131979, "label": "v_user", "properties": {"age": 21, "uid": "U1", "name": "Bob", "type": "B"}}::vertex (2 rows)
SELECT * FROM cypher('toys', $$ MATCH (v:v_product) RETURN v $$) as (v agtype);
Sample result:
v ------------------------------------------------------------------------------------------------------------------- {"id": 1125899906842625, "label": "v_product", "properties": {"price": "50", "product_id": "INAKLIDAS"}}::vertex {"id": 1125899906842626, "label": "v_product", "properties": {"price": "80", "product_id": "ENKUCLKSD"}}::vertex {"id": 1125899906842627, "label": "v_product", "properties": {"price": "320", "product_id": "IIUIHAKLS"}}::vertex {"id": 1125899906842628, "label": "v_product", "properties": {"price": "340", "product_id": "SDVDSUHEE"}}::vertex (4 rows)
Query edges.
Query the discount Alice uses for each purchase.
SELECT buy, sell, sell::int - buy::int as discount FROM cypher('toys', $$ MATCH (v1:v_user)-[e:e_own]-(v2:v_product) WHERE v1.name = 'Bob' RETURN e.buy_price, v2.price $$) as (buy agtype, sell agtype);
Sample result:
buy | sell | discount -------+-------+---------- "70" | "80" | 10 "50" | "50" | 0 "330" | "340" | 10 (3 rows)
Subsequent data import and update
After creating nodes by using the
CREATE
statement in Cypher, you cannot use the quick data import method any more. TheCREATE
statement may disrupt the unique ID mapping generated by the xxx_idgen function or JOIN operation.After the import is complete, the xxx_raw tables and xxx_idgen functions can be removed, unless you want to use this method to import data again. The graph data is not affected. During the import, skip the nodes and edges that are already imported to prevent duplicated nodes and edges.
Use this method to insert a new node.
INSERT INTO v_user_raw VALUES ('A','U2', 'Alien', '99', v_user_idgen('A', 'U2')); INSERT INTO toys.v_user SELECT (age_name_to_idx_start('toys', 'v', 'v_user') + _id)::text::graphid, row_to_json((select x FROM (select type, uid, name, age) x))::text::agtype FROM (SELECT (('A','U2', 'Alien', '97', v_user_idgen('A', 'U2'))::v_user_raw).*) raw; SELECT setval(age_name_to_seq('toys', 'v', 'v_user'), (SELECT max(_id) + 1 FROM v_user_raw)); INSERT INTO v_product_raw(product_id, price) VALUES ('AIEEEEEEE', '999'); INSERT INTO toys."v_product" SELECT (age_name_to_idx_start('toys', 'v', 'v_product') + _id)::text::graphid, row_to_json((select x FROM (select product_id, price) x))::text::agtype FROM v_product_raw WHERE product_id = 'AIEEEEEEE'; SELECT setval(age_name_to_seq('toys', 'v', 'v_product'), (SELECT max(_id) + 1 FROM v_product_raw));
Use this method to insert a new edge. You need to remove the original edge table and perform the edge import steps again.
DELETE FROM e_own_raw; INSERT INTO e_own_raw VALUES ('A', 'U2', 'AIEEEEEEE', '9999'); UPDATE e_own_raw SET _fromid = v_user_idgen(user_type, user_uid); UPDATE e_own_raw SET _toid = v_product_raw._id FROM v_product_raw WHERE v_product_raw.product_id = e_own_raw.product_id; INSERT INTO toys."e_own" SELECT (age_name_to_idx_start('toys', 'e', 'e_own') + nextval(age_name_to_seq('toys', 'e', 'e_own')))::text::graphid, (age_name_to_idx_start('toys', 'v', 'v_user') + _fromid)::text::graphid, (age_name_to_idx_start('toys', 'v', 'v_product') + _toid)::text::graphid, row_to_json((select x FROM (select buy_price) x))::text::agtype FROM e_own_raw;
Conclusion
You can use tables for quick import of external nodes and edges into the graph database. The performance of this method is over 2 times higher than that of the MATCH ... CREATE ...
method.
Free trial
Visit the PolarDB free trial page and select "PolarDB" to try the graph computing capabilities of GanosBase provided by PolarDB for PostgreSQL.