Hologres is a one-stop real-time data warehouse engine that is compatible with the PostgreSQL protocol. It supports writing, updating, and querying massive amounts of data in both real-time and offline scenarios. This topic explains how to choose the optimal write mode for Spark batch writing to Hologres, using performance test results to guide your decision.
For performance test results for real-time data writing and updates, see Best practices for stress testing in data write, update, and point query scenarios.
This topic focuses on batch write scenarios. FIXED COPY and INSERT INTO VALUES modes are included for comparison only. For streaming write best practices, refer to the dedicated streaming documentation.
Choose a write mode
Hologres supports three write modes for batch data: COPY, FIXED COPY, and INSERT INTO VALUES.
Default recommendation: Use COPY mode unless you need real-time data visibility or cannot tolerate a table lock during import.
Use the following decision tree to select the right mode for your scenario.
Write mode comparison
The following table compares the three modes across key dimensions.
<table> <thead> <tr> <td><p><b>Comparison item</b></p></td> <td><p><b>COPY</b></p></td> <td><p><b>FIXED COPY</b></p></td> <td><p><b>INSERT INTO VALUES</b></p></td> </tr> </thead> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <tbody> <tr> <td><p><b>Introduction</b></p></td> <td><p>Batch import to a table without a primary key using COPY</p></td> <td><p>Batch import to a table with a primary key using COPY</p></td> <td><p>A stream import mode developed based on the COPY protocol</p></td> <td><p>A basic stream import mode</p></td> </tr> <tr> <td><p><b>Typical scenarios</b></p></td> <td> <ul> <li><p>Spark batch import</p></li> <li><p>Open source Flink batch import</p></li> </ul></td> <td> <ul> <li><p>Flink import</p></li> <li><p>DataWorks data integration import</p></li> <li><p>Spark stream import</p></li> </ul></td> <td> <ul> <li><p>Flink import</p></li> <li><p>DataWorks data integration import</p></li> <li><p>Spark stream import</p></li> </ul></td> </tr> <tr> <td><p><b>Lock granularity</b></p></td> <td><p>Row locks</p></td> <td><p>Table lock</p></td> <td><p>Row locks</p></td> <td><p>Row locks</p></td> </tr> <tr> <td><p><b>Data visibility</b></p></td> <td><p>Visible after the COPY operation is complete</p></td> <td><p>Visible after the COPY operation is complete</p></td> <td><p>Real-time visibility</p></td> <td><p>Real-time visibility</p></td> </tr> <tr> <td><p><b>Performance</b></p></td> <td><p>High</p></td> <td><p>High</p></td> <td><p>Medium</p></td> <td><p>Medium</p></td> </tr> <tr> <td><p><b>Hologres resource consumption</b></p></td> <td><p>Low</p></td> <td><p>Low</p></td> <td><p>High</p></td> <td><p>High</p></td> </tr> <tr> <td><p><b>Client resource consumption</b></p></td> <td><p>Low</p></td> <td><p>High</p></td> <td><p>Low</p></td> <td><p>Middle</p></td> </tr> <tr> <td><p><b>Supported policies for primary key conflicts</b></p></td> <td><p>Not applicable</p></td> <td> <ul> <li><p>NONE</p></li> <li><p>UPDATE: Full row updates are supported from V3.0.4. Partial updates are supported from V3.1.1.</p></li> <li><p>IGNORE: Supported from V3.0.4.</p></li> </ul></td> <td> <ul> <li><p>NONE (an error is reported if a conflict occurs)</p></li> <li><p>UPDATE</p></li> <li><p>IGNORE</p></li> </ul></td> </tr> </tbody> </table>
Mode details
-
COPY mode: The traditional COPY mode provides the best write performance and lowest Hologres resource consumption. Use this for Spark or open source Flink batch imports when you do not need real-time data visibility. Suitable for tables without a primary key.
-
FIXED COPY mode: A stream import mode based on the COPY protocol. It acquires a table lock during import, which prevents concurrent writes from multiple tasks. Data becomes visible after the COPY operation completes. Suitable for tables with a primary key.
For primary key conflict policies, UPDATE is more resource-intensive than IGNORE because it performs a full row or partial row overwrite. Choose IGNORE when you only need to skip duplicate rows.
-
INSERT INTO VALUES mode: Converts batch imports into traditional stream writes. It no longer has significant advantages in batch writing scenarios. However, because only INSERT mode supports data retraction (deletion), use it for scenarios such as writing binary logging data.
Configure for your integration
Spark batch writes
Upgrade your Hologres instance to V2.2.25 or later. Set the write.mode write parameter to auto (the default). The connector automatically selects the optimal mode.
Flink batch writes
For Flink batch writes, determine which mode to use with the decision tree above, then configure the following parameters:
-
jdbccopywritemode: Set toTRUEto prevent INSERT mode. -
bulkload: Set toTRUEfor COPY mode orFALSEfor FIXED COPY mode.
Batch write performance test
This test uses the open source Hologres-Spark-Connector to measure write performance across different modes, table storage formats, and primary key conflict strategies.
Prerequisites
The Hologres instance and the EMR-Spark cluster must be in the same region and use the same Virtual Private Cloud (VPC).
Before running the test, prepare the following:
-
A Hologres instance running V2.2.25 or later with a database created
-
An EMR-Spark cluster running Spark V3.3.0 or later. See Create a cluster for setup steps
-
The
hologres-connector-spark-3.xconnector JAR, downloaded from the Maven Central Repository -
The OSS-HDFS service activated for the EMR-Spark cluster
The test environment used in this topic:
| Service | Version | Specifications |
|---|---|---|
| Hologres | V3.0.30 | 64 cores, 256 GB (1 CU = 1 core and 4 GB memory) |
| EMR-Spark | EMR-5.18.1, Spark-3.5.3 | 8 cores, 32 GB x 8 (1 master node, 7 core nodes) |
| Spark-Connector | 1.5.2 | Not applicable |
Prepare the test data
Step 1: Generate raw data
-
Log on to the master node of the EMR-Spark cluster. See Methods for connecting to an ECS instance.
-
Download TPC-H_Tools_v3.0.0.zip, copy the package to the master node, decompress it, and navigate to the
TPC-H_Tools_v3.0.0/TPC-H_Tools_v3.0.0/dbgendirectory. -
Run the following command to generate the 1 TB
customer.tbltest file (the raw .tbl file is 23 GB) in thedbgendirectory.Table information Description Number of fields 8 Field types INT, BIGINT, TEXT, DECIMAL Number of rows 150,000,000 Number of shards 40 ./dbgen -s 1000 -T cThe customer table in the TPC-H 1 TB test dataset has the following structure:
Step 2: Upload data to Spark
Run the following command to upload customer.tbl to the Spark cluster, where <spark_resource> is the Root Path of Cluster Storage you configured when creating the cluster.
hadoop fs -put customer.tbl <spark_resource>
Step 3: Create destination tables in Hologres
Create tables in each storage format to test all scenarios.
Columnar and row-oriented
CREATE TABLE test_table_mixed (
C_CUSTKEY BIGINT PRIMARY KEY,
C_NAME TEXT,
C_ADDRESS TEXT,
C_NATIONKEY INT,
C_PHONE TEXT,
C_ACCTBAL DECIMAL(15, 2),
C_MKTSEGMENT TEXT,
C_COMMENT TEXT
)
WITH (
orientation = 'column,row'
);
Columnar (with primary key)
CREATE TABLE test_table_column (
C_CUSTKEY BIGINT PRIMARY KEY,
C_NAME TEXT,
C_ADDRESS TEXT,
C_NATIONKEY INT,
C_PHONE TEXT,
C_ACCTBAL DECIMAL(15, 2),
C_MKTSEGMENT TEXT,
C_COMMENT TEXT
)
WITH (
orientation = 'column'
);
Columnar (without primary key)
CREATE TABLE test_table_column_no_pk (
C_CUSTKEY BIGINT,
C_NAME TEXT,
C_ADDRESS TEXT,
C_NATIONKEY INT,
C_PHONE TEXT,
C_ACCTBAL DECIMAL(15, 2),
C_MKTSEGMENT TEXT,
C_COMMENT TEXT
)
WITH (
orientation = 'column'
);
Row-oriented (with primary key)
CREATE TABLE test_table_row (
C_CUSTKEY BIGINT PRIMARY KEY,
C_NAME TEXT,
C_ADDRESS TEXT,
C_NATIONKEY INT,
C_PHONE TEXT,
C_ACCTBAL DECIMAL(15, 2),
C_MKTSEGMENT TEXT,
C_COMMENT TEXT
)
WITH (
orientation = 'row'
);
Row-oriented (without primary key)
CREATE TABLE test_table_row_no_pk (
C_CUSTKEY BIGINT,
C_NAME TEXT,
C_ADDRESS TEXT,
C_NATIONKEY INT,
C_PHONE TEXT,
C_ACCTBAL DECIMAL(15, 2),
C_MKTSEGMENT TEXT,
C_COMMENT TEXT
)
WITH (
orientation = 'row'
);
Run the performance test
Test configuration
-
Log on to the master node of the EMR-Spark cluster. See Methods for connecting to an ECS instance. Upload the Spark-Connector JAR and run the following command to start the spark-sql interactive interface.
Adjust
spark.sql.files.maxPartitionBytesto control the concurrency for Spark to read HDFS files. This test uses a concurrency of 40.# Start the spark-sql interactive interface spark-sql --jars <path>/hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar \ --conf spark.executor.instances=40 \ --conf spark.executor.cores=1 \ --conf spark.executor.memory=4g \ --conf spark.sql.files.maxPartitionBytes=644245094Replace
<path>with the directory containing thehologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jarfile. -
In the spark-sql interface, run the following SQL to write data using a temporary table. A temporary table lets you adjust parameters across multiple test runs.
In production, use a Catalog to load Hologres tables directly, which simplifies the process.
Placeholder Description Example <spark_resources>The Root Path of Cluster Storage configured when creating the EMR-Spark cluster. Find it on the Basic Information page of your cluster in the EMR on ECS console. — <hologres_vpc_endpoint>The VPC Domain Name of the Hologres instance. Find it in the network information section of the Hologres Management Console. Format: <Instance ID>-cn-hangzhou-vpc-st.hologres.aliyuncs.com:80— <database_name>The name of the database in the Hologres instance. — <accesskey_id>The AccessKey ID with read permissions on the Hologres database. — <accesskey_secret>The AccessKey secret with read permissions on the Hologres database. — <table_name>The name of the destination table in Hologres. test_table_columnValue Write mode autoDefault. The connector selects the optimal mode. insertINSERT INTO VALUES streamFIXED COPY bulk_loadCOPY import to a table without a primary key bulk_load_on_conflictCOPY import to a table with a primary key Parameter Description write.insert.thread_sizeWrite concurrency. Takes effect only in INSERT mode. write.insert.batch_sizeWrite batch size. Takes effect only in INSERT mode. write.on_conflict_actionINSERT_OR_REPLACE(default): updates on primary key conflict.INSERT_OR_IGNORE: skips the row on primary key conflict.-- Create a temporary table in CSV format. CREATE TEMPORARY VIEW csvtable ( c_custkey BIGINT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal DECIMAL(15, 2), c_mktsegment STRING, c_comment STRING) USING csv OPTIONS ( path "<spark_resources>/customer.tbl", sep "|" ); CREATE TEMPORARY VIEW hologresTable ( c_custkey BIGINT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal DECIMAL(15, 2), c_mktsegment STRING, c_comment STRING) USING hologres OPTIONS ( jdbcurl "jdbc:postgresql://<hologres_vpc_endpoint>/<database_name>", username "<accesskey_id>", password "<accesskey_secret>", table "<table_name>", direct_connect "false", write.mode "auto", write.insert.thread_size "3", write.insert.batch_size "2048" ); INSERT INTO hologresTable SELECT * FROM csvTable;Replace the following placeholders:
write.modeparameter values: Additional write parameters: For the full parameter reference, see Parameter descriptions.
Test scenarios
| Dimension | Variants tested |
|---|---|
| Table storage format | Row-oriented, columnar, columnar and row-oriented |
| Data update method | Append-only (no primary key), initial write to empty table (primary key), full row update (primary key) |
| Write mode | insert, stream, bulk_load, bulk_load_on_conflict |
Test metrics
Each test run captures the following four metrics:
-
Total job duration: The total runtime of the Spark job, measured from when you run the
INSERTstatement in the spark-sql interface to when it completes. -
Average data write duration: The average duration of the write operation only, excluding Spark cluster scheduling, data reading, and data redistribution overhead. Run the following SQL on the HoloWeb page of the Hologres Management Console to get the shard concurrency (count) and average write duration (avg_duration_ms, in milliseconds):
SELECT COUNT(*), AVG(duration) AS avg_duration_ms FROM hologres.hg_query_log WHERE query_start >= '<start_time>' AND query_start <= '<end_time>' AND query LIKE '%<test_table_column>%' AND command_tag = 'COPY';Replace
<start_time>and<end_time>with the write start and end times (for example,2025-04-11 15:00:00), and<test_table_column>with the destination table name. -
Hologres load: CPU utilization of the Hologres instance. Get this from the Monitoring Information page of the instance in the Hologres Management Console.
-
Spark load: EMR node CPU utilization during the write window. In the EMR on ECS console, click your cluster ID, go to Monitoring and Diagnostics > Metric Monitoring, set Dashboard to HOST, filter by nodeGroupId and hostname for the core nodes, set the time range to the write window, and query the CPU utilization metric.
Test results
Write mode values in the table correspond to:insert= INSERT INTO VALUES,stream= FIXED COPY,bulk_load= COPY import into a table without a primary key,bulk_load_on_conflict= COPY import into a table with a primary key.
|
Storage format |
Primary key |
Write mode |
Total job duration |
Average data write duration |
Hologres load |
Spark load |
|
Columnar |
Without primary key |
insert |
241.61s |
232.70s |
92% |
15% |
|
stream |
228.11s |
222.34s |
100% |
36% |
||
|
bulk_load |
88.72s |
57.16s |
97% |
47% |
||
|
With primary key ignore |
insert |
190.96s |
172.60s |
90% |
14% |
|
|
stream |
149.60s |
142.16s |
100% |
14% |
||
|
bulk_load_on_conflict |
115.96s |
42.92s |
60% |
75% |
||
|
With primary key replace |
insert |
600.40s |
574.31s |
91% |
5% |
|
|
stream |
550.29s |
540.32s |
100% |
5% |
||
|
bulk_load_on_conflict |
188.05s |
109.77s |
93% |
78% |
||
|
Row-oriented |
Without primary key |
insert |
132.38s |
123.79s |
94% |
22% |
|
stream |
114.41s |
103.81s |
100% |
17% |
||
|
bulk_load |
68.20s |
41.22s |
98% |
32% |
||
|
With primary key ignore |
insert |
190.48s |
170.49s |
89% |
15% |
|
|
stream |
185.46s |
172.48s |
85% |
14% |
||
|
bulk_load_on_conflict |
117.81s |
47.69s |
58% |
75% |
||
|
With primary key replace |
insert |
177.97s |
170.78s |
93% |
15% |
|
|
stream |
142.44s |
130.16s |
100% |
20% |
||
|
bulk_load_on_conflict |
137.69s |
65.18s |
92% |
78% |
||
|
Columnar and row-oriented |
With primary key ignore |
insert |
172.19s |
158.74s |
86% |
16% |
|
stream |
150.63s |
149.76s |
100% |
12% |
||
|
bulk_load_on_conflict |
128.83s |
42.09s |
59% |
79% |
||
|
With primary key replace |
insert |
690.37s |
662.00s |
92% |
5% |
|
|
stream |
625.84s |
623.08s |
100% |
4% |
||
|
bulk_load_on_conflict |
202.07s |
121.58s |
93% |
80% |
Key observations:
-
bulk_loadandbulk_load_on_conflict(COPY mode) consistently deliver the shortest total job duration across all storage formats, typically 1.3–3x faster thaninsertorstreamfor the same scenario. -
For replace (full row update) scenarios, the performance gap is most pronounced:
bulk_load_on_conflictcompletes in 188s vs. 600s forinserton columnar tables with primary keys. -
COPY modes shift work to the Spark client (Spark load: 32–80%) and reduce Hologres CPU load (58–98%), compared to INSERT and FIXED COPY (Spark load: 4–36%, Hologres load: 85–100%).
-
For tables without a primary key,
bulk_loadis the clear choice: shortest job duration with moderate resource usage on both sides.