Best practices for batch writing data to Hologres - Hologres

Hologres is a one-stop real-time data warehouse engine compatible with the PostgreSQL protocol. It supports massive data write, update, and query in real-time and offline integrated scenarios. This topic describes how to choose an appropriate write mode for different batch write scenarios based on performance testing results of using Spark to batch write data to Hologres.

For real-time data write and update, see Best practices for performance tests on data writes, data updates, and point queries for performance testing plans and results.

Comparison of batch write modes

For batch data write scenarios, Hologres supports different write modes, including the traditional COPY mode, FIXED COPY streaming import mode developed based on the COPY protocol, and INSERT INTO VALUES mode that converts batch import into streaming import.

The following table provides a detailed comparison of the three write modes.

Comparison item	COPY		FIXED COPY	INSERT INTO VALUES
Introduction	COPY-based batch import to tables without primary keys	COPY-based batch import to tables with primary keys	Streaming import mode developed based on the COPY protocol	Basic streaming import mode
Typical scenarios	Use Spark for batch import Use Apache Flink for batch import		Use Flink for import Use DataWorks Data Integration for import Use Spark for streaming import	Use Flink for import Use DataWorks Data Integration for import Use Spark for streaming import
Lock granularity	Row lock	Table lock	Row lock	Row lock
Data visibility	Visible after the COPY operation completes	Visible after the COPY operation completes	Visible in real time	Visible in real time
Performance	High	High	Medium	Medium
Hologres resource consumption	Low	Low	High	High
Client resource consumption	Low	High	Low	Medium
Primary key conflict policies	N/A	NONE UPDATE: Supports full row update in V3.0.4 and later, and both full row update and partial update in V3.1.1 and later. IGNORE: Supported in V3.0.4 and later.	NONE (an error returned if a conflict occurs) UPDATE IGNORE	NONE (an error returned if a conflict occurs) UPDATE IGNORE

Select a batch write mode

In batch data write scenarios, the characteristics of the three modes are as follows:

COPY: the traditional COPY mode, with the optimal write performance and lowest Hologres resource consumption.
FIXED COPY: a streaming import mode developed based on the COPY protocol. It only generates row locks, and data is visible in real time.
INSERT INTO VALUES: converts batch import into traditional streaming write. This mode has no obvious advantages in batch write scenarios.
Note
The INSERT INTO VALUES mode has no obvious advantages only in batch write scenarios. However, only this mode has the data retraction (deletion) capability, which is still required in scenarios such as binary log data writing.

If you have no special requirements for real-time data visibility, lock granularity, or data source load, you can prioritize the COPY mode for batch writing. Note that when a table lock is generated, the table cannot support multiple data write tasks simultaneously.

Use Spark for batch writing: We recommend that you upgrade your Hologres instance to V2.2.25 or later, and set the write parameter write.mode of the connector to auto (default value). The system will automatically select the optimal write mode.
Use Flink for batch writing: You need to first determine whether to use the COPY mode or FIXED COPY mode according to the decision tree below, and then configure the following parameters:
- jdbccopywritemode: Set this parameter to TRUE, which means the INSERT INTO VALUES mode is not used.
- bulkload: TRUE indicates the COPY mode, and FALSE indicates the FIXED COPY mode. Configure this parameter based on your business requirements.

You can select an appropriate write mode according to the following decision tree.

Batch write performance testing

The open-source component Hologres-Spark-Connector developed by Hologres is used in the testing process.

Preparations

Prepare basic environments

You need to prepare the following environment:

Important

The Hologres instance and EMR Spark cluster must reside in the same region and use the same VPC.

Purchase a Hologres instance of V2.2.25 or later, and create a database.
Create an EMR Spark cluster. The Spark version must be greater than or equal to 3.3.0). For more information, see Create a cluster.
Download the Spark-Connector package.
You can download the connector JAR package hologres-connector-spark-3.x, which is required for Spark to read data from and write data to Hologres, from the Maven Central Repository.

The following environment information is used in this topic.

Service	Version	Specifications
Hologres	V3.0.30	64 cores, 256 GB (1 CU = 1 core and 4 GB)
EMR Spark	EMR V5.18.1, Spark 3.5.3	8 cores, 32 GB × 8 (1 master node and 7 core nodes) Important You need to activate the OSS-HDFS service.
Spark-Connector	1.5.2	N/A

Prepare test data

Prepare raw data.
1. Log on to the master node of the EMR Spark cluster. For more information, see Log on to the master node of a cluster by connecting to the specific ECS instance.
2. Click TPC-H_Tools_v3.0.0.zip to download the TPC-H tool, copy it to the ECS instance of the master node, extract it, and then enter the TPC-H_Tools_v3.0.0/TPC-H_Tools_v3.0.0/dbgen directory.
3. Execute the following code to generate the 1 TB test dataset file customer.tbl in the dbgen directory. The original tbl file size is 23 GB.
```
./dbgen -s 1000 -T c
```
  The following table provides the detailed information of the customer table in the TPC-H 1 TB dataset.
  Table information
  Description
  Number of fields
  8
  Field types
  INT, BIGINT, TEXT, DECIMAL
  Number of data rows
  150,000,000
  Number of table shards
  40
Import test data to Spark.
Execute the following command to upload the customer.tbl file to the Spark cluster:
```
hadoop fs -put customer.tbl <spark_resource>
```
spark_resource indicates the path that is configured for the Root Storage Directory of Cluster parameter when you create the EMR Spark cluster.

Create tables with different storage formats in Hologres. Sample SQL statements:

Row-column hybrid storage

CREATE TABLE test_table_mixed (
    C_CUSTKEY BIGINT PRIMARY KEY,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'column,row'
);

Column-oriented storage (with primary key)

CREATE TABLE test_table_column (
    C_CUSTKEY BIGINT PRIMARY KEY,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'column'
);

Column-oriented storage (without primary key)

CREATE TABLE test_table_column_no_pk (
    C_CUSTKEY BIGINT,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'column'
);

Row-oriented storage (with primary key)

CREATE TABLE test_table_row (
    C_CUSTKEY BIGINT PRIMARY KEY,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'row'
);

Row-oriented storage (without primary key)

CREATE TABLE test_table_row_no_pk (
    C_CUSTKEY BIGINT,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'row'
);

Performance Testing

Test configuration

This topic mainly tests the import performance under different modes.

Log on to the master node of the EMR Spark cluster (for the logon method, see Log on to the master node of a cluster by connecting to the specific ECS instance), upload the downloaded Spark-Connector package, and then run the following commands to enter the spark-sql command-line interface (CLI).
Note
By adjusting the value of the spark.sql.files.maxPartitionBytes parameter, you can control the concurrency of Spark reading HDFS files. The concurrency in this example is 40.
```
# Enter the spark-sql CLI.

spark-sql --jars <path>/hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar \
--conf spark.executor.instances=40 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=4g \
--conf spark.sql.files.maxPartitionBytes=644245094
```
path indicates the root path where the hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar package is located.

Execute the following SQL statements in the spark-sql CLI to write data by creating temporary tables.

Since parameters need to be adjusted multiple times during the testing process to test the write performance of different modes, temporary tables are used in this example.

Note

In actual business scenarios, you can directly use catalogs to load Hologres tables, which is more convenient.

-- Create a temporary table in CSV format
CREATE TEMPORARY VIEW csvtable (
  c_custkey BIGINT,
  c_name STRING,
  c_address STRING,
  c_nationkey INT,
  c_phone STRING,
  c_acctbal DECIMAL(15, 2),
  c_mktsegment STRING,
  c_comment STRING)
USING csv OPTIONS (
  path "<spark_resources>/customer.tbl", sep "|"
);

CREATE TEMPORARY VIEW hologresTable (
  c_custkey BIGINT,
  c_name STRING,
  c_address STRING,
  c_nationkey INT,
  c_phone STRING,
  c_acctbal DECIMAL(15, 2),
  c_mktsegment STRING,
  c_comment STRING)
USING hologres OPTIONS (
  jdbcurl "jdbc:postgresql://<hologres_vpc_endpoint>/<database_name>",
  username "<accesskey_id>", 
  password "<accesskey_secret>", 
  table "<table_name>",
  direct_connect "false",
  write.mode "auto",
  write.insert.thread_size "3",
  write.insert.batch_size "2048"
);

INSERT INTO hologresTable SELECT * FROM csvTable;

The following table describes the parameters.

Parameter	Description
spark_resources	The path that is configured for the Root Storage Directory of Cluster parameter when you create the EMR Spark cluster. You can log on to the EMR on ECS console, click the target cluster ID, and obtain the root storage path of the cluster in the Cluster Information section of the Basic Information tab.
hologres_vpc_endpoint	The endpoint of the VPC network type of the Hologres instance. You can log on to the Hologres console, click the target instance ID, and obtain the VPC endpoint in the Network Information section of the Instance Details page. For example, the VPC endpoint in the China (Hangzhou) region is in the format of `<Instance ID>-cn-hangzhou-vpc-st.hologres.aliyuncs.com:80`.
database_name	The name of the database in the Hologres instance.
accesskey_id	The AccessKey ID with read permissions on the corresponding Hologres database.
accesskey_secret	The AccessKey secret with read permissions on the corresponding Hologres database.
table_name	The name of the destination Hologres table.
write.mode	The write mode. Valid values: `auto`: Default value, indicating that the connector automatically selects the optimal mode . `insert`: Use the INSERT INTO VALUES mode. `stream`: Use the FIXED COPY mode for streaming writing. `bulk_load`: Use the COPY mode for batch import to tables without primary keys. `bulk_load_on_conflict`: Use the COPY mode for batch import to tables with primary keys.
write.insert.thread_size	The write concurrency, only effective when using the INSERT INTO VALUES mode.
write.insert.batch_size	The batch write size, only effective when using the INSERT INTO VALUES mode.
write.on_conflict_action	INSERT_OR_REPLACE (default): Update on primary key conflicts. INSERT_OR_IGNORE: Ignore on primary key conflicts.

For more parameter information, see Parameters.

Test scenarios

Test scenario	Available options
Table storage format	Row-oriented storage Column-oriented storage Row-column hybrid storage
Data update method	Append Only write to tables without primary keys First write to empty tables with primary keys Full row update to tables with primary keys
Data writing method	insert (INSERT INTO VALUES) stream (FIXED COPY) bulk_load (COPY-based import to tables without primary keys) bulk_load_on_conflict (COPY-based import to tables with primary keys)

Test results

The test results include the following fields.

Field	Description
Total job execution duration	The total execution duration of the Spark job. This field indicates the time taken to complete the INSERT operation in the spark-sql CLI on the master node of the EMR Spark cluster. The following figure shows how to obtain the time.
Average data write duration	The average time spent on data writing only, excluding the time for Spark cluster scheduling, data reading, and data redistribution. You can execute the following SQL statement in HoloWeb of the Hologres console to obtain the Shard concurrency (count) and average data write duration (avg_duration_ms, in milliseconds). `SELECT COUNT(), AVG(duration) AS avg_duration_ms FROM hologres.hg_query_log WHERE query_start >= '<start_time>' AND query_start <= '<end_time>' AND query LIKE '%<test_table_column>%' AND command_tag = 'COPY';` Parameters: start_time:* The start time of data writing, such as `2025-04-11 15:00:00`. end_time: The end time of data writing, such as `2025-04-11 15:10:00`. hologres_table_name: The name of the destination Hologres table, such as `test_table_column`.
Hologres load	The CPU utilization of the Hologres instance. You can obtain the CPU utilization on the Monitoring Information page of the instance in the Hologres console.
Spark load	The EMR node load. You can log on to the EMR on ECS console, click the cluster ID, and then click the Monitoring and Diagnostics tab. Then, click the Metric Monitoring tab, and configure the following parameters: Dashboard: Select HOST. nodeGroupId: Select the core node group ID of the cluster. hostname: Select each core node. Select Time: Select the time period when data writing was performed in Spark. Then query the CPU utilization metric, which is the Spark load.

The following table provides the detailed test results.

Storage format	Primary key	Write mode	Total job execution duration	Average data write duration	Hologres load	Spark load
Column-oriented storage	No primary key	insert	241.61s	232.70s	92%	15%
		stream	228.11s	222.34s	100%	36%
		bulk_load	88.72s	57.16s	97%	47%
	With primary key Ignore	insert	190.96s	172.60s	90%	14%
		stream	149.60s	142.16s	100%	14%
		bulk_load_on_conflict	115.96s	42.92s	60%	75%
	With primary key Replace	insert	600.40s	574.31s	91%	5%
		stream	550.29s	540.32s	100%	5%
		bulk_load_on_conflict	188.05s	109.77s	93%	78%
Row-oriented storage	No primary key	insert	132.38s	123.79s	94%	22%
		stream	114.41s	103.81s	100%	17%
		bulk_load	68.20s	41.22s	98%	32%
	With primary key Ignore	insert	190.48s	170.49s	89%	15%
		stream	185.46s	172.48s	85%	14%
		bulk_load_on_conflict	117.81s	47.69s	58%	75%
	With primary key Replace	insert	177.97s	170.78s	93%	15%
		stream	142.44s	130.16s	100%	20%
		bulk_load_on_conflict	137.69s	65.18s	92%	78%
Row-column hybrid storage	With primary key Ignore	insert	172.19s	158.74s	86%	16%
		stream	150.63s	149.76s	100%	12%
		bulk_load_on_conflict	128.83s	42.09s	59%	79%
	With primary key Replace	insert	690.37s	662.00s	92%	5%
		stream	625.84s	623.08s	100%	4%
		bulk_load_on_conflict	202.07s	121.58s	93%	80%

Note

The mappings of write modes are as follows:

insert: INSERT INTO VALUES
stream: FIXED COPY
bulk_load: COPY-based import to tables without primary keys
bulk_load_on_conflict: COPY-based import to tables with primary keys

Table information	Description
Number of fields	8
Field types	INT, BIGINT, TEXT, DECIMAL
Number of data rows	150,000,000
Number of table shards	40