All Products
Search
Document Center

Hologres:Best practices for batch writing data to Hologres

Last Updated:Apr 22, 2025

Hologres is a one-stop real-time data warehouse engine compatible with the PostgreSQL protocol. It supports massive data write, update, and query in real-time and offline integrated scenarios. This topic describes how to choose an appropriate write mode for different batch write scenarios based on performance testing results of using Spark to batch write data to Hologres.

For real-time data write and update, see Best practices for performance tests on data writes, data updates, and point queries for performance testing plans and results.

Comparison of batch write modes

For batch data write scenarios, Hologres supports different write modes, including the traditional COPY mode, FIXED COPY streaming import mode developed based on the COPY protocol, and INSERT INTO VALUES mode that converts batch import into streaming import.

The following table provides a detailed comparison of the three write modes.

Comparison item

COPY

FIXED COPY

INSERT INTO VALUES

Introduction

COPY-based batch import to tables without primary keys

COPY-based batch import to tables with primary keys

Streaming import mode developed based on the COPY protocol

Basic streaming import mode

Typical scenarios

  • Use Spark for batch import

  • Use Apache Flink for batch import

  • Use Flink for import

  • Use DataWorks Data Integration for import

  • Use Spark for streaming import

  • Use Flink for import

  • Use DataWorks Data Integration for import

  • Use Spark for streaming import

Lock granularity

Row lock

Table lock

Row lock

Row lock

Data visibility

Visible after the COPY operation completes

Visible after the COPY operation completes

Visible in real time

Visible in real time

Performance

High

High

Medium

Medium

Hologres resource consumption

Low

Low

High

High

Client resource consumption

Low

High

Low

Medium

Primary key conflict policies

N/A

  • NONE

  • UPDATE: Supports full row update in V3.0.4 and later.

  • IGNORE: Supported in V3.0.4 and later.

  • NONE (an error returned if a conflict occurs)

  • UPDATE

  • IGNORE

  • NONE (an error returned if a conflict occurs)

  • UPDATE

  • IGNORE

Select a batch write mode

In batch data write scenarios, the characteristics of the three modes are as follows:

  • COPY: the traditional COPY mode, with the optimal write performance and lowest Hologres resource consumption.

  • FIXED COPY: a streaming import mode developed based on the COPY protocol. It only generates row locks, and data is visible in real time.

  • INSERT INTO VALUES: converts batch import into traditional streaming write. This mode has no obvious advantages in batch write scenarios.

    Note

    The INSERT INTO VALUES mode has no obvious advantages only in batch write scenarios. However, only this mode has the data retraction (deletion) capability, which is still required in scenarios such as binary log data writing.

If you have no special requirements for real-time data visibility, lock granularity, or data source load, you can prioritize the COPY mode for batch writing. Note that when a table lock is generated, the table cannot support multiple data write tasks simultaneously.

  • Use Spark for batch writing: We recommend that you upgrade your Hologres instance to V2.2.25 or later, and set the write parameter write.mode of the connector to auto (default value). The system will automatically select the optimal write mode.

  • Use Flink for batch writing: You need to first determine whether to use the COPY mode or FIXED COPY mode according to the decision tree below, and then configure the following parameters:

    • jdbccopywritemode: Set this parameter to TRUE, which means the INSERT INTO VALUES mode is not used.

    • bulkload: TRUE indicates the COPY mode, and FALSE indicates the FIXED COPY mode. Configure this parameter based on your business requirements.

You can select an appropriate write mode according to the following decision tree.

image

Batch write performance testing

The open-source component Hologres-Spark-Connector developed by Hologres is used in the testing process.

Preparations

Prepare basic environments

You need to prepare the following environment:

Important

The Hologres instance and EMR Spark cluster must reside in the same region and use the same VPC.

  • Purchase a Hologres instance of V2.2.25 or later, and create a database.

  • Create an EMR Spark cluster. The Spark version must be greater than or equal to 3.3.0). For more information, see Create a cluster.

  • Download the Spark-Connector package.

    You can download the connector JAR package hologres-connector-spark-3.x, which is required for Spark to read data from and write data to Hologres, from the Maven Central Repository.

The following environment information is used in this topic.

Service

Version

Specifications

Hologres

V3.0.30

64 cores, 256 GB (1 CU = 1 core and 4 GB)

EMR Spark

EMR V5.18.1, Spark 3.5.3

8 cores, 32 GB × 8 (1 master node and 7 core nodes)

Important

You need to activate the OSS-HDFS service.

Spark-Connector

1.5.2

N/A

Prepare test data

  1. Prepare raw data.

    1. Log on to the master node of the EMR Spark cluster. For more information, see Log on to the master node of a cluster by connecting to the specific ECS instance.

    2. Click TPC-H_Tools_v3.0.0.zip to download the TPC-H tool, copy it to the ECS instance of the master node, extract it, and then enter the TPC-H_Tools_v3.0.0/TPC-H_Tools_v3.0.0/dbgen directory.

    3. Execute the following code to generate the 1 TB test dataset file customer.tbl in the dbgen directory. The original tbl file size is 23 GB.

      ./dbgen -s 1000 -T c

      The following table provides the detailed information of the customer table in the TPC-H 1 TB dataset.

      Table information

      Description

      Number of fields

      8

      Field types

      INT, BIGINT, TEXT, DECIMAL

      Number of data rows

      150,000,000

      Number of table shards

      40

  2. Import test data to Spark.

    Execute the following command to upload the customer.tbl file to the Spark cluster:

    hadoop fs -put customer.tbl <spark_resource>

    spark_resource indicates the path that is configured for the Root Storage Directory of Cluster parameter when you create the EMR Spark cluster.

  3. Create tables with different storage formats in Hologres. Sample SQL statements:

    Row-column hybrid storage

    CREATE TABLE test_table_mixed (
        C_CUSTKEY BIGINT PRIMARY KEY,
        C_NAME TEXT,
        C_ADDRESS TEXT,
        C_NATIONKEY INT,
        C_PHONE TEXT,
        C_ACCTBAL DECIMAL(15, 2),
        C_MKTSEGMENT TEXT,
        C_COMMENT TEXT
    )
    WITH (
        orientation = 'column,row'
    );

    Column-oriented storage (with primary key)

    CREATE TABLE test_table_column (
        C_CUSTKEY BIGINT PRIMARY KEY,
        C_NAME TEXT,
        C_ADDRESS TEXT,
        C_NATIONKEY INT,
        C_PHONE TEXT,
        C_ACCTBAL DECIMAL(15, 2),
        C_MKTSEGMENT TEXT,
        C_COMMENT TEXT
    )
    WITH (
        orientation = 'column'
    );

    Column-oriented storage (without primary key)

    CREATE TABLE test_table_column_no_pk (
        C_CUSTKEY BIGINT,
        C_NAME TEXT,
        C_ADDRESS TEXT,
        C_NATIONKEY INT,
        C_PHONE TEXT,
        C_ACCTBAL DECIMAL(15, 2),
        C_MKTSEGMENT TEXT,
        C_COMMENT TEXT
    )
    WITH (
        orientation = 'column'
    );

    Row-oriented storage (with primary key)

    CREATE TABLE test_table_row (
        C_CUSTKEY BIGINT PRIMARY KEY,
        C_NAME TEXT,
        C_ADDRESS TEXT,
        C_NATIONKEY INT,
        C_PHONE TEXT,
        C_ACCTBAL DECIMAL(15, 2),
        C_MKTSEGMENT TEXT,
        C_COMMENT TEXT
    )
    WITH (
        orientation = 'row'
    );

    Row-oriented storage (without primary key)

    CREATE TABLE test_table_row_no_pk (
        C_CUSTKEY BIGINT,
        C_NAME TEXT,
        C_ADDRESS TEXT,
        C_NATIONKEY INT,
        C_PHONE TEXT,
        C_ACCTBAL DECIMAL(15, 2),
        C_MKTSEGMENT TEXT,
        C_COMMENT TEXT
    )
    WITH (
        orientation = 'row'
    );

Performance Testing

Test configuration

This topic mainly tests the import performance under different modes.

  1. Log on to the master node of the EMR Spark cluster (for the logon method, see Log on to the master node of a cluster by connecting to the specific ECS instance), upload the downloaded Spark-Connector package, and then run the following commands to enter the spark-sql command-line interface (CLI).

    Note

    By adjusting the value of the spark.sql.files.maxPartitionBytes parameter, you can control the concurrency of Spark reading HDFS files. The concurrency in this example is 40.

    # Enter the spark-sql CLI.
    
    spark-sql --jars <path>/hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar \
    --conf spark.executor.instances=40 \
    --conf spark.executor.cores=1 \
    --conf spark.executor.memory=4g \
    --conf spark.sql.files.maxPartitionBytes=644245094

    path indicates the root path where the hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar package is located.

  2. Execute the following SQL statements in the spark-sql CLI to write data by creating temporary tables.

    Since parameters need to be adjusted multiple times during the testing process to test the write performance of different modes, temporary tables are used in this example.

    Note

    In actual business scenarios, you can directly use catalogs to load Hologres tables, which is more convenient.

    -- Create a temporary table in CSV format
    CREATE TEMPORARY VIEW csvtable (
      c_custkey BIGINT,
      c_name STRING,
      c_address STRING,
      c_nationkey INT,
      c_phone STRING,
      c_acctbal DECIMAL(15, 2),
      c_mktsegment STRING,
      c_comment STRING)
    USING csv OPTIONS (
      path "<spark_resources>/customer.tbl", sep "|"
    );
    
    CREATE TEMPORARY VIEW hologresTable (
      c_custkey BIGINT,
      c_name STRING,
      c_address STRING,
      c_nationkey INT,
      c_phone STRING,
      c_acctbal DECIMAL(15, 2),
      c_mktsegment STRING,
      c_comment STRING)
    USING hologres OPTIONS (
      jdbcurl "jdbc:postgresql://<hologres_vpc_endpoint>/<database_name>",
      username "<accesskey_id>", 
      password "<accesskey_secret>", 
      table "<table_name>",
      direct_connect "false",
      write.mode "auto",
      write.insert.thread_size "3",
      write.insert.batch_size "2048"
    );
    
    INSERT INTO hologresTable SELECT * FROM csvTable;

    The following table describes the parameters.

    Parameter

    Description

    spark_resources

    The path that is configured for the Root Storage Directory of Cluster parameter when you create the EMR Spark cluster.

    You can log on to the EMR on ECS console, click the target cluster ID, and obtain the root storage path of the cluster in the Cluster Information section of the Basic Information tab.

    hologres_vpc_endpoint

    The endpoint of the VPC network type of the Hologres instance.

    You can log on to the Hologres console, click the target instance ID, and obtain the VPC endpoint in the Network Information section of the Instance Details page. For example, the VPC endpoint in the China (Hangzhou) region is in the format of <Instance ID>-cn-hangzhou-vpc-st.hologres.aliyuncs.com:80.

    database_name

    The name of the database in the Hologres instance.

    accesskey_id

    The AccessKey ID with read permissions on the corresponding Hologres database.

    accesskey_secret

    The AccessKey secret with read permissions on the corresponding Hologres database.

    table_name

    The name of the destination Hologres table.

    write.mode

    The write mode. Valid values:

    • auto: Default value, indicating that the connector automatically selects the optimal mode .

    • insert: Use the INSERT INTO VALUES mode.

    • stream: Use the FIXED COPY mode for streaming writing.

    • bulk_load: Use the COPY mode for batch import to tables without primary keys.

    • bulk_load_on_conflict: Use the COPY mode for batch import to tables with primary keys.

    write.insert.thread_size

    The write concurrency, only effective when using the INSERT INTO VALUES mode.

    write.insert.batch_size

    The batch write size, only effective when using the INSERT INTO VALUES mode.

    write.on_conflict_action

    INSERT_OR_REPLACE (default): Update on primary key conflicts.

    INSERT_OR_IGNORE: Ignore on primary key conflicts.

    For more parameter information, see Parameters.

Test scenarios

Test scenario

Available options

Table storage format

  • Row-oriented storage

  • Column-oriented storage

  • Row-column hybrid storage

Data update method

  • Append Only write to tables without primary keys

  • First write to empty tables with primary keys

  • Full row update to tables with primary keys

Data writing method

  • insert (INSERT INTO VALUES)

  • stream (FIXED COPY)

  • bulk_load (COPY-based import to tables without primary keys)

  • bulk_load_on_conflict (COPY-based import to tables with primary keys)

Test results

The test results include the following fields.

Field

Description

Total job execution duration

The total execution duration of the Spark job.

This field indicates the time taken to complete the INSERT operation in the spark-sql CLI on the master node of the EMR Spark cluster. The following figure shows how to obtain the time.image

Average data write duration

The average time spent on data writing only, excluding the time for Spark cluster scheduling, data reading, and data redistribution.

You can execute the following SQL statement in HoloWeb of the Hologres console to obtain the Shard concurrency (count) and average data write duration (avg_duration_ms, in milliseconds).

SELECT
    COUNT(*), AVG(duration) AS avg_duration_ms
FROM
    hologres.hg_query_log
WHERE
    query_start >= '<start_time>' AND query_start <= '<end_time>'
    AND query LIKE '%<test_table_column>%' AND command_tag = 'COPY';

Parameters:

  • start_time: The start time of data writing, such as 2025-04-11 15:00:00.

  • end_time: The end time of data writing, such as 2025-04-11 15:10:00.

  • hologres_table_name: The name of the destination Hologres table, such as test_table_column.

Hologres load

The CPU utilization of the Hologres instance.

You can obtain the CPU utilization on the Monitoring Information page of the instance in the Hologres console.

Spark load

The EMR node load.

You can log on to the EMR on ECS console, click the cluster ID, and then click the Monitoring and Diagnostics tab. Then, click the Metric Monitoring tab, and configure the following parameters:

  • Dashboard: Select HOST.

  • nodeGroupId: Select the core node group ID of the cluster.

  • hostname: Select each core node.

  • Select Time: Select the time period when data writing was performed in Spark.

Then query the CPU utilization metric, which is the Spark load.

The following table provides the detailed test results.

Storage format

Primary key

Write mode

Total job execution duration

Average data write duration

Hologres load

Spark load

Column-oriented storage

No primary key

insert

241.61s

232.70s

92%

15%

stream

228.11s

222.34s

100%

36%

bulk_load

88.72s

57.16s

97%

47%

With primary key

Ignore

insert

190.96s

172.60s

90%

14%

stream

149.60s

142.16s

100%

14%

bulk_load_on_conflict

115.96s

42.92s

60%

75%

With primary key

Replace

insert

600.40s

574.31s

91%

5%

stream

550.29s

540.32s

100%

5%

bulk_load_on_conflict

188.05s

109.77s

93%

78%

Row-oriented storage

No primary key

insert

132.38s

123.79s

94%

22%

stream

114.41s

103.81s

100%

17%

bulk_load

68.20s

41.22s

98%

32%

With primary key

Ignore

insert

190.48s

170.49s

89%

15%

stream

185.46s

172.48s

85%

14%

bulk_load_on_conflict

117.81s

47.69s

58%

75%

With primary key

Replace

insert

177.97s

170.78s

93%

15%

stream

142.44s

130.16s

100%

20%

bulk_load_on_conflict

137.69s

65.18s

92%

78%

Row-column hybrid storage

With primary key

Ignore

insert

172.19s

158.74s

86%

16%

stream

150.63s

149.76s

100%

12%

bulk_load_on_conflict

128.83s

42.09s

59%

79%

With primary key

Replace

insert

690.37s

662.00s

92%

5%

stream

625.84s

623.08s

100%

4%

bulk_load_on_conflict

202.07s

121.58s

93%

80%

Note

The mappings of write modes are as follows:

  • insert: INSERT INTO VALUES

  • stream: FIXED COPY

  • bulk_load: COPY-based import to tables without primary keys

  • bulk_load_on_conflict: COPY-based import to tables with primary keys