All Products
Search
Document Center

Hologres:Best practices for batch write

Last Updated:Mar 26, 2026

Hologres is a one-stop real-time data warehouse engine that is compatible with the PostgreSQL protocol. It supports writing, updating, and querying massive amounts of data in both real-time and offline scenarios. This topic explains how to choose the optimal write mode for Spark batch writing to Hologres, using performance test results to guide your decision.

For performance test results for real-time data writing and updates, see Best practices for stress testing in data write, update, and point query scenarios.

This topic focuses on batch write scenarios. FIXED COPY and INSERT INTO VALUES modes are included for comparison only. For streaming write best practices, refer to the dedicated streaming documentation.

Choose a write mode

Hologres supports three write modes for batch data: COPY, FIXED COPY, and INSERT INTO VALUES.

Default recommendation: Use COPY mode unless you need real-time data visibility or cannot tolerate a table lock during import.

Use the following decision tree to select the right mode for your scenario.

Decision tree for selecting a batch write mode

Write mode comparison

The following table compares the three modes across key dimensions.

<table> <thead> <tr> <td><p><b>Comparison item</b></p></td> <td><p><b>COPY</b></p></td> <td><p><b>FIXED COPY</b></p></td> <td><p><b>INSERT INTO VALUES</b></p></td> </tr> </thead> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <tbody> <tr> <td><p><b>Introduction</b></p></td> <td><p>Batch import to a table without a primary key using COPY</p></td> <td><p>Batch import to a table with a primary key using COPY</p></td> <td><p>A stream import mode developed based on the COPY protocol</p></td> <td><p>A basic stream import mode</p></td> </tr> <tr> <td><p><b>Typical scenarios</b></p></td> <td> <ul> <li><p>Spark batch import</p></li> <li><p>Open source Flink batch import</p></li> </ul></td> <td> <ul> <li><p>Flink import</p></li> <li><p>DataWorks data integration import</p></li> <li><p>Spark stream import</p></li> </ul></td> <td> <ul> <li><p>Flink import</p></li> <li><p>DataWorks data integration import</p></li> <li><p>Spark stream import</p></li> </ul></td> </tr> <tr> <td><p><b>Lock granularity</b></p></td> <td><p>Row locks</p></td> <td><p>Table lock</p></td> <td><p>Row locks</p></td> <td><p>Row locks</p></td> </tr> <tr> <td><p><b>Data visibility</b></p></td> <td><p>Visible after the COPY operation is complete</p></td> <td><p>Visible after the COPY operation is complete</p></td> <td><p>Real-time visibility</p></td> <td><p>Real-time visibility</p></td> </tr> <tr> <td><p><b>Performance</b></p></td> <td><p>High</p></td> <td><p>High</p></td> <td><p>Medium</p></td> <td><p>Medium</p></td> </tr> <tr> <td><p><b>Hologres resource consumption</b></p></td> <td><p>Low</p></td> <td><p>Low</p></td> <td><p>High</p></td> <td><p>High</p></td> </tr> <tr> <td><p><b>Client resource consumption</b></p></td> <td><p>Low</p></td> <td><p>High</p></td> <td><p>Low</p></td> <td><p>Middle</p></td> </tr> <tr> <td><p><b>Supported policies for primary key conflicts</b></p></td> <td><p>Not applicable</p></td> <td> <ul> <li><p>NONE</p></li> <li><p>UPDATE: Full row updates are supported from V3.0.4. Partial updates are supported from V3.1.1.</p></li> <li><p>IGNORE: Supported from V3.0.4.</p></li> </ul></td> <td> <ul> <li><p>NONE (an error is reported if a conflict occurs)</p></li> <li><p>UPDATE</p></li> <li><p>IGNORE</p></li> </ul></td> </tr> </tbody> </table>

Mode details

  • COPY mode: The traditional COPY mode provides the best write performance and lowest Hologres resource consumption. Use this for Spark or open source Flink batch imports when you do not need real-time data visibility. Suitable for tables without a primary key.

  • FIXED COPY mode: A stream import mode based on the COPY protocol. It acquires a table lock during import, which prevents concurrent writes from multiple tasks. Data becomes visible after the COPY operation completes. Suitable for tables with a primary key.

    For primary key conflict policies, UPDATE is more resource-intensive than IGNORE because it performs a full row or partial row overwrite. Choose IGNORE when you only need to skip duplicate rows.
  • INSERT INTO VALUES mode: Converts batch imports into traditional stream writes. It no longer has significant advantages in batch writing scenarios. However, because only INSERT mode supports data retraction (deletion), use it for scenarios such as writing binary logging data.

Configure for your integration

Spark batch writes

Upgrade your Hologres instance to V2.2.25 or later. Set the write.mode write parameter to auto (the default). The connector automatically selects the optimal mode.

Flink batch writes

For Flink batch writes, determine which mode to use with the decision tree above, then configure the following parameters:

  • jdbccopywritemode: Set to TRUE to prevent INSERT mode.

  • bulkload: Set to TRUE for COPY mode or FALSE for FIXED COPY mode.

Batch write performance test

This test uses the open source Hologres-Spark-Connector to measure write performance across different modes, table storage formats, and primary key conflict strategies.

Prerequisites

Important

The Hologres instance and the EMR-Spark cluster must be in the same region and use the same Virtual Private Cloud (VPC).

Before running the test, prepare the following:

The test environment used in this topic:

Service Version Specifications
Hologres V3.0.30 64 cores, 256 GB (1 CU = 1 core and 4 GB memory)
EMR-Spark EMR-5.18.1, Spark-3.5.3 8 cores, 32 GB x 8 (1 master node, 7 core nodes)
Spark-Connector 1.5.2 Not applicable

Prepare the test data

Step 1: Generate raw data

  1. Log on to the master node of the EMR-Spark cluster. See Methods for connecting to an ECS instance.

  2. Download TPC-H_Tools_v3.0.0.zip, copy the package to the master node, decompress it, and navigate to the TPC-H_Tools_v3.0.0/TPC-H_Tools_v3.0.0/dbgen directory.

  3. Run the following command to generate the 1 TB customer.tbl test file (the raw .tbl file is 23 GB) in the dbgen directory.

    Table information Description
    Number of fields 8
    Field types INT, BIGINT, TEXT, DECIMAL
    Number of rows 150,000,000
    Number of shards 40
    ./dbgen -s 1000 -T c

    The customer table in the TPC-H 1 TB test dataset has the following structure:

Step 2: Upload data to Spark

Run the following command to upload customer.tbl to the Spark cluster, where <spark_resource> is the Root Path of Cluster Storage you configured when creating the cluster.

hadoop fs -put customer.tbl <spark_resource>

Step 3: Create destination tables in Hologres

Create tables in each storage format to test all scenarios.

Columnar and row-oriented

CREATE TABLE test_table_mixed (
    C_CUSTKEY BIGINT PRIMARY KEY,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'column,row'
);

Columnar (with primary key)

CREATE TABLE test_table_column (
    C_CUSTKEY BIGINT PRIMARY KEY,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'column'
);

Columnar (without primary key)

CREATE TABLE test_table_column_no_pk (
    C_CUSTKEY BIGINT,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'column'
);

Row-oriented (with primary key)

CREATE TABLE test_table_row (
    C_CUSTKEY BIGINT PRIMARY KEY,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'row'
);

Row-oriented (without primary key)

CREATE TABLE test_table_row_no_pk (
    C_CUSTKEY BIGINT,
    C_NAME TEXT,
    C_ADDRESS TEXT,
    C_NATIONKEY INT,
    C_PHONE TEXT,
    C_ACCTBAL DECIMAL(15, 2),
    C_MKTSEGMENT TEXT,
    C_COMMENT TEXT
)
WITH (
    orientation = 'row'
);

Run the performance test

Test configuration

  1. Log on to the master node of the EMR-Spark cluster. See Methods for connecting to an ECS instance. Upload the Spark-Connector JAR and run the following command to start the spark-sql interactive interface.

    Adjust spark.sql.files.maxPartitionBytes to control the concurrency for Spark to read HDFS files. This test uses a concurrency of 40.
    # Start the spark-sql interactive interface
    spark-sql --jars <path>/hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar \
    --conf spark.executor.instances=40 \
    --conf spark.executor.cores=1 \
    --conf spark.executor.memory=4g \
    --conf spark.sql.files.maxPartitionBytes=644245094

    Replace <path> with the directory containing the hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar file.

  2. In the spark-sql interface, run the following SQL to write data using a temporary table. A temporary table lets you adjust parameters across multiple test runs.

    In production, use a Catalog to load Hologres tables directly, which simplifies the process.
    Placeholder Description Example
    <spark_resources> The Root Path of Cluster Storage configured when creating the EMR-Spark cluster. Find it on the Basic Information page of your cluster in the EMR on ECS console.
    <hologres_vpc_endpoint> The VPC Domain Name of the Hologres instance. Find it in the network information section of the Hologres Management Console. Format: <Instance ID>-cn-hangzhou-vpc-st.hologres.aliyuncs.com:80
    <database_name> The name of the database in the Hologres instance.
    <accesskey_id> The AccessKey ID with read permissions on the Hologres database.
    <accesskey_secret> The AccessKey secret with read permissions on the Hologres database.
    <table_name> The name of the destination table in Hologres. test_table_column
    Value Write mode
    auto Default. The connector selects the optimal mode.
    insert INSERT INTO VALUES
    stream FIXED COPY
    bulk_load COPY import to a table without a primary key
    bulk_load_on_conflict COPY import to a table with a primary key
    Parameter Description
    write.insert.thread_size Write concurrency. Takes effect only in INSERT mode.
    write.insert.batch_size Write batch size. Takes effect only in INSERT mode.
    write.on_conflict_action INSERT_OR_REPLACE (default): updates on primary key conflict. INSERT_OR_IGNORE: skips the row on primary key conflict.
    -- Create a temporary table in CSV format.
    CREATE TEMPORARY VIEW csvtable (
      c_custkey BIGINT,
      c_name STRING,
      c_address STRING,
      c_nationkey INT,
      c_phone STRING,
      c_acctbal DECIMAL(15, 2),
      c_mktsegment STRING,
      c_comment STRING)
    USING csv OPTIONS (
      path "<spark_resources>/customer.tbl", sep "|"
    );
    
    CREATE TEMPORARY VIEW hologresTable (
      c_custkey BIGINT,
      c_name STRING,
      c_address STRING,
      c_nationkey INT,
      c_phone STRING,
      c_acctbal DECIMAL(15, 2),
      c_mktsegment STRING,
      c_comment STRING)
    USING hologres OPTIONS (
      jdbcurl "jdbc:postgresql://<hologres_vpc_endpoint>/<database_name>",
      username "<accesskey_id>",
      password "<accesskey_secret>",
      table "<table_name>",
      direct_connect "false",
      write.mode "auto",
      write.insert.thread_size "3",
      write.insert.batch_size "2048"
    );
    
    INSERT INTO hologresTable SELECT * FROM csvTable;

    Replace the following placeholders: write.mode parameter values: Additional write parameters: For the full parameter reference, see Parameter descriptions.

Test scenarios

Dimension Variants tested
Table storage format Row-oriented, columnar, columnar and row-oriented
Data update method Append-only (no primary key), initial write to empty table (primary key), full row update (primary key)
Write mode insert, stream, bulk_load, bulk_load_on_conflict

Test metrics

Each test run captures the following four metrics:

  • Total job duration: The total runtime of the Spark job, measured from when you run the INSERT statement in the spark-sql interface to when it completes.

  • Average data write duration: The average duration of the write operation only, excluding Spark cluster scheduling, data reading, and data redistribution overhead. Run the following SQL on the HoloWeb page of the Hologres Management Console to get the shard concurrency (count) and average write duration (avg_duration_ms, in milliseconds):

    SELECT
     COUNT(*), AVG(duration) AS avg_duration_ms
    FROM
     hologres.hg_query_log
    WHERE
     query_start >= '<start_time>' AND query_start <= '<end_time>'
     AND query LIKE '%<test_table_column>%' AND command_tag = 'COPY';

    Replace <start_time> and <end_time> with the write start and end times (for example, 2025-04-11 15:00:00), and <test_table_column> with the destination table name.

  • Hologres load: CPU utilization of the Hologres instance. Get this from the Monitoring Information page of the instance in the Hologres Management Console.

  • Spark load: EMR node CPU utilization during the write window. In the EMR on ECS console, click your cluster ID, go to Monitoring and Diagnostics > Metric Monitoring, set Dashboard to HOST, filter by nodeGroupId and hostname for the core nodes, set the time range to the write window, and query the CPU utilization metric.

Test results

Write mode values in the table correspond to: insert = INSERT INTO VALUES, stream = FIXED COPY, bulk_load = COPY import into a table without a primary key, bulk_load_on_conflict = COPY import into a table with a primary key.

Storage format

Primary key

Write mode

Total job duration

Average data write duration

Hologres load

Spark load

Columnar

Without primary key

insert

241.61s

232.70s

92%

15%

stream

228.11s

222.34s

100%

36%

bulk_load

88.72s

57.16s

97%

47%

With primary key

ignore

insert

190.96s

172.60s

90%

14%

stream

149.60s

142.16s

100%

14%

bulk_load_on_conflict

115.96s

42.92s

60%

75%

With primary key

replace

insert

600.40s

574.31s

91%

5%

stream

550.29s

540.32s

100%

5%

bulk_load_on_conflict

188.05s

109.77s

93%

78%

Row-oriented

Without primary key

insert

132.38s

123.79s

94%

22%

stream

114.41s

103.81s

100%

17%

bulk_load

68.20s

41.22s

98%

32%

With primary key

ignore

insert

190.48s

170.49s

89%

15%

stream

185.46s

172.48s

85%

14%

bulk_load_on_conflict

117.81s

47.69s

58%

75%

With primary key

replace

insert

177.97s

170.78s

93%

15%

stream

142.44s

130.16s

100%

20%

bulk_load_on_conflict

137.69s

65.18s

92%

78%

Columnar and row-oriented

With primary key

ignore

insert

172.19s

158.74s

86%

16%

stream

150.63s

149.76s

100%

12%

bulk_load_on_conflict

128.83s

42.09s

59%

79%

With primary key

replace

insert

690.37s

662.00s

92%

5%

stream

625.84s

623.08s

100%

4%

bulk_load_on_conflict

202.07s

121.58s

93%

80%

Key observations:

  • bulk_load and bulk_load_on_conflict (COPY mode) consistently deliver the shortest total job duration across all storage formats, typically 1.3–3x faster than insert or stream for the same scenario.

  • For replace (full row update) scenarios, the performance gap is most pronounced: bulk_load_on_conflict completes in 188s vs. 600s for insert on columnar tables with primary keys.

  • COPY modes shift work to the Spark client (Spark load: 32–80%) and reduce Hologres CPU load (58–98%), compared to INSERT and FIXED COPY (Spark load: 4–36%, Hologres load: 85–100%).

  • For tables without a primary key, bulk_load is the clear choice: shortest job duration with moderate resource usage on both sides.