Hologres is a one-stop real-time data warehouse engine compatible with the PostgreSQL protocol. It supports massive data write, update, and query in real-time and offline integrated scenarios. This topic describes how to choose an appropriate write mode for different batch write scenarios based on performance testing results of using Spark to batch write data to Hologres.
For real-time data write and update, see Best practices for performance tests on data writes, data updates, and point queries for performance testing plans and results.
Comparison of batch write modes
For batch data write scenarios, Hologres supports different write modes, including the traditional COPY mode, FIXED COPY streaming import mode developed based on the COPY protocol, and INSERT INTO VALUES mode that converts batch import into streaming import.
The following table provides a detailed comparison of the three write modes.
Comparison item | COPY | FIXED COPY | INSERT INTO VALUES | |
Introduction | COPY-based batch import to tables without primary keys | COPY-based batch import to tables with primary keys | Streaming import mode developed based on the COPY protocol | Basic streaming import mode |
Typical scenarios |
|
|
| |
Lock granularity | Row lock | Table lock | Row lock | Row lock |
Data visibility | Visible after the COPY operation completes | Visible after the COPY operation completes | Visible in real time | Visible in real time |
Performance | High | High | Medium | Medium |
Hologres resource consumption | Low | Low | High | High |
Client resource consumption | Low | High | Low | Medium |
Primary key conflict policies | N/A |
|
|
|
Select a batch write mode
In batch data write scenarios, the characteristics of the three modes are as follows:
COPY: the traditional COPY mode, with the optimal write performance and lowest Hologres resource consumption.
FIXED COPY: a streaming import mode developed based on the COPY protocol. It only generates row locks, and data is visible in real time.
INSERT INTO VALUES: converts batch import into traditional streaming write. This mode has no obvious advantages in batch write scenarios.
NoteThe INSERT INTO VALUES mode has no obvious advantages only in batch write scenarios. However, only this mode has the data retraction (deletion) capability, which is still required in scenarios such as binary log data writing.
If you have no special requirements for real-time data visibility, lock granularity, or data source load, you can prioritize the COPY mode for batch writing. Note that when a table lock is generated, the table cannot support multiple data write tasks simultaneously.
Use Spark for batch writing: We recommend that you upgrade your Hologres instance to V2.2.25 or later, and set the write parameter
write.mode
of the connector toauto
(default value). The system will automatically select the optimal write mode.Use Flink for batch writing: You need to first determine whether to use the COPY mode or FIXED COPY mode according to the decision tree below, and then configure the following parameters:
jdbccopywritemode: Set this parameter to TRUE, which means the INSERT INTO VALUES mode is not used.
bulkload: TRUE indicates the COPY mode, and FALSE indicates the FIXED COPY mode. Configure this parameter based on your business requirements.
You can select an appropriate write mode according to the following decision tree.
Batch write performance testing
The open-source component Hologres-Spark-Connector developed by Hologres is used in the testing process.
Preparations
Prepare basic environments
You need to prepare the following environment:
The Hologres instance and EMR Spark cluster must reside in the same region and use the same VPC.
Purchase a Hologres instance of V2.2.25 or later, and create a database.
Create an EMR Spark cluster. The Spark version must be greater than or equal to 3.3.0). For more information, see Create a cluster.
Download the Spark-Connector package.
You can download the connector JAR package
hologres-connector-spark-3.x
, which is required for Spark to read data from and write data to Hologres, from the Maven Central Repository.
The following environment information is used in this topic.
Service | Version | Specifications |
Hologres | V3.0.30 | 64 cores, 256 GB (1 CU = 1 core and 4 GB) |
EMR Spark | EMR V5.18.1, Spark 3.5.3 | 8 cores, 32 GB × 8 (1 master node and 7 core nodes) Important You need to activate the OSS-HDFS service. |
Spark-Connector | 1.5.2 | N/A |
Prepare test data
Prepare raw data.
Log on to the master node of the EMR Spark cluster. For more information, see Log on to the master node of a cluster by connecting to the specific ECS instance.
Click TPC-H_Tools_v3.0.0.zip to download the TPC-H tool, copy it to the ECS instance of the master node, extract it, and then enter the
TPC-H_Tools_v3.0.0/TPC-H_Tools_v3.0.0/dbgen
directory.Execute the following code to generate the
1 TB
test dataset filecustomer.tbl
in thedbgen
directory. The original tbl file size is 23 GB../dbgen -s 1000 -T c
The following table provides the detailed information of the customer table in the TPC-H 1 TB dataset.
Table information
Description
Number of fields
8
Field types
INT, BIGINT, TEXT, DECIMAL
Number of data rows
150,000,000
Number of table shards
40
Import test data to Spark.
Execute the following command to upload the customer.tbl file to the Spark cluster:
hadoop fs -put customer.tbl <spark_resource>
spark_resource indicates the path that is configured for the Root Storage Directory of Cluster parameter when you create the EMR Spark cluster.
Create tables with different storage formats in Hologres. Sample SQL statements:
Row-column hybrid storage
CREATE TABLE test_table_mixed ( C_CUSTKEY BIGINT PRIMARY KEY, C_NAME TEXT, C_ADDRESS TEXT, C_NATIONKEY INT, C_PHONE TEXT, C_ACCTBAL DECIMAL(15, 2), C_MKTSEGMENT TEXT, C_COMMENT TEXT ) WITH ( orientation = 'column,row' );
Column-oriented storage (with primary key)
CREATE TABLE test_table_column ( C_CUSTKEY BIGINT PRIMARY KEY, C_NAME TEXT, C_ADDRESS TEXT, C_NATIONKEY INT, C_PHONE TEXT, C_ACCTBAL DECIMAL(15, 2), C_MKTSEGMENT TEXT, C_COMMENT TEXT ) WITH ( orientation = 'column' );
Column-oriented storage (without primary key)
CREATE TABLE test_table_column_no_pk ( C_CUSTKEY BIGINT, C_NAME TEXT, C_ADDRESS TEXT, C_NATIONKEY INT, C_PHONE TEXT, C_ACCTBAL DECIMAL(15, 2), C_MKTSEGMENT TEXT, C_COMMENT TEXT ) WITH ( orientation = 'column' );
Row-oriented storage (with primary key)
CREATE TABLE test_table_row ( C_CUSTKEY BIGINT PRIMARY KEY, C_NAME TEXT, C_ADDRESS TEXT, C_NATIONKEY INT, C_PHONE TEXT, C_ACCTBAL DECIMAL(15, 2), C_MKTSEGMENT TEXT, C_COMMENT TEXT ) WITH ( orientation = 'row' );
Row-oriented storage (without primary key)
CREATE TABLE test_table_row_no_pk ( C_CUSTKEY BIGINT, C_NAME TEXT, C_ADDRESS TEXT, C_NATIONKEY INT, C_PHONE TEXT, C_ACCTBAL DECIMAL(15, 2), C_MKTSEGMENT TEXT, C_COMMENT TEXT ) WITH ( orientation = 'row' );
Performance Testing
Test configuration
This topic mainly tests the import performance under different modes.
Log on to the master node of the EMR Spark cluster (for the logon method, see Log on to the master node of a cluster by connecting to the specific ECS instance), upload the downloaded Spark-Connector package, and then run the following commands to enter the spark-sql command-line interface (CLI).
NoteBy adjusting the value of the
spark.sql.files.maxPartitionBytes
parameter, you can control the concurrency of Spark reading HDFS files. The concurrency in this example is 40.# Enter the spark-sql CLI. spark-sql --jars <path>/hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar \ --conf spark.executor.instances=40 \ --conf spark.executor.cores=1 \ --conf spark.executor.memory=4g \ --conf spark.sql.files.maxPartitionBytes=644245094
path indicates the root path where the hologres-connector-spark-3.x-1.5.2-jar-with-dependencies.jar package is located.
Execute the following SQL statements in the spark-sql CLI to write data by creating temporary tables.
Since parameters need to be adjusted multiple times during the testing process to test the write performance of different modes, temporary tables are used in this example.
NoteIn actual business scenarios, you can directly use catalogs to load Hologres tables, which is more convenient.
-- Create a temporary table in CSV format CREATE TEMPORARY VIEW csvtable ( c_custkey BIGINT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal DECIMAL(15, 2), c_mktsegment STRING, c_comment STRING) USING csv OPTIONS ( path "<spark_resources>/customer.tbl", sep "|" ); CREATE TEMPORARY VIEW hologresTable ( c_custkey BIGINT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal DECIMAL(15, 2), c_mktsegment STRING, c_comment STRING) USING hologres OPTIONS ( jdbcurl "jdbc:postgresql://<hologres_vpc_endpoint>/<database_name>", username "<accesskey_id>", password "<accesskey_secret>", table "<table_name>", direct_connect "false", write.mode "auto", write.insert.thread_size "3", write.insert.batch_size "2048" ); INSERT INTO hologresTable SELECT * FROM csvTable;
The following table describes the parameters.
Parameter
Description
spark_resources
The path that is configured for the Root Storage Directory of Cluster parameter when you create the EMR Spark cluster.
You can log on to the EMR on ECS console, click the target cluster ID, and obtain the root storage path of the cluster in the Cluster Information section of the Basic Information tab.
hologres_vpc_endpoint
The endpoint of the VPC network type of the Hologres instance.
You can log on to the Hologres console, click the target instance ID, and obtain the VPC endpoint in the Network Information section of the Instance Details page. For example, the VPC endpoint in the China (Hangzhou) region is in the format of
<Instance ID>-cn-hangzhou-vpc-st.hologres.aliyuncs.com:80
.database_name
The name of the database in the Hologres instance.
accesskey_id
The AccessKey ID with read permissions on the corresponding Hologres database.
accesskey_secret
The AccessKey secret with read permissions on the corresponding Hologres database.
table_name
The name of the destination Hologres table.
write.mode
The write mode. Valid values:
auto
: Default value, indicating that the connector automatically selects the optimal mode .insert
: Use the INSERT INTO VALUES mode.stream
: Use the FIXED COPY mode for streaming writing.bulk_load
: Use the COPY mode for batch import to tables without primary keys.bulk_load_on_conflict
: Use the COPY mode for batch import to tables with primary keys.
write.insert.thread_size
The write concurrency, only effective when using the INSERT INTO VALUES mode.
write.insert.batch_size
The batch write size, only effective when using the INSERT INTO VALUES mode.
write.on_conflict_action
INSERT_OR_REPLACE (default): Update on primary key conflicts.
INSERT_OR_IGNORE: Ignore on primary key conflicts.
For more parameter information, see Parameters.
Test scenarios
Test scenario | Available options |
Table storage format |
|
Data update method |
|
Data writing method |
|
Test results
The test results include the following fields.
Field | Description |
Total job execution duration | The total execution duration of the Spark job. This field indicates the time taken to complete the INSERT operation in the spark-sql CLI on the master node of the EMR Spark cluster. The following figure shows how to obtain the time. |
Average data write duration | The average time spent on data writing only, excluding the time for Spark cluster scheduling, data reading, and data redistribution. You can execute the following SQL statement in HoloWeb of the Hologres console to obtain the Shard concurrency (count) and average data write duration (avg_duration_ms, in milliseconds).
Parameters:
|
Hologres load | The CPU utilization of the Hologres instance. You can obtain the CPU utilization on the Monitoring Information page of the instance in the Hologres console. |
Spark load | The EMR node load. You can log on to the EMR on ECS console, click the cluster ID, and then click the Monitoring and Diagnostics tab. Then, click the Metric Monitoring tab, and configure the following parameters:
Then query the CPU utilization metric, which is the Spark load. |
The following table provides the detailed test results.
Storage format | Primary key | Write mode | Total job execution duration | Average data write duration | Hologres load | Spark load |
Column-oriented storage | No primary key | insert | 241.61s | 232.70s | 92% | 15% |
stream | 228.11s | 222.34s | 100% | 36% | ||
bulk_load | 88.72s | 57.16s | 97% | 47% | ||
With primary key Ignore | insert | 190.96s | 172.60s | 90% | 14% | |
stream | 149.60s | 142.16s | 100% | 14% | ||
bulk_load_on_conflict | 115.96s | 42.92s | 60% | 75% | ||
With primary key Replace | insert | 600.40s | 574.31s | 91% | 5% | |
stream | 550.29s | 540.32s | 100% | 5% | ||
bulk_load_on_conflict | 188.05s | 109.77s | 93% | 78% | ||
Row-oriented storage | No primary key | insert | 132.38s | 123.79s | 94% | 22% |
stream | 114.41s | 103.81s | 100% | 17% | ||
bulk_load | 68.20s | 41.22s | 98% | 32% | ||
With primary key Ignore | insert | 190.48s | 170.49s | 89% | 15% | |
stream | 185.46s | 172.48s | 85% | 14% | ||
bulk_load_on_conflict | 117.81s | 47.69s | 58% | 75% | ||
With primary key Replace | insert | 177.97s | 170.78s | 93% | 15% | |
stream | 142.44s | 130.16s | 100% | 20% | ||
bulk_load_on_conflict | 137.69s | 65.18s | 92% | 78% | ||
Row-column hybrid storage | With primary key Ignore | insert | 172.19s | 158.74s | 86% | 16% |
stream | 150.63s | 149.76s | 100% | 12% | ||
bulk_load_on_conflict | 128.83s | 42.09s | 59% | 79% | ||
With primary key Replace | insert | 690.37s | 662.00s | 92% | 5% | |
stream | 625.84s | 623.08s | 100% | 4% | ||
bulk_load_on_conflict | 202.07s | 121.58s | 93% | 80% |
The mappings of write modes are as follows:
insert: INSERT INTO VALUES
stream: FIXED COPY
bulk_load: COPY-based import to tables without primary keys
bulk_load_on_conflict: COPY-based import to tables with primary keys