Run a TPC-H performance test for OLAP queries on EMR Serverless StarRocks - E-MapReduce

Reference results

The following results were collected on a backend (BE) with 8 compute units (CUs) — 8 CPU cores and 32 GB of memory — against a scale factor (SF) of 1 (1 GB of raw data).

All 22 queries completed in approximately 10 seconds total.

Query	Time (seconds)
q1	0.339
q2	0.213
q3	0.262
q4	0.210
q5	0.778
q6	0.103
q7	0.346
q8	3.229
q9	0.611
q10	0.557
q11	0.100
q12	0.151
q13	0.347
q14	0.098
q15	0.175
q16	0.142
q17	0.248
q18	0.518
q19	0.130
q20	0.400
q21	0.600
q22	0.065
Total	~10.026

Test environment:

Component	Specification
BE compute units	8 CUs (8 CPU cores, 32 GB memory)
ECS instance type	`ecs.g6e.4xlarge`
ECS operating system	CentOS 7.9
ECS data disk	Enterprise SSD (ESSD)
Scale factor	SF 1 (1 GB raw data)

TPC-H overview

TPC-H is a decision support benchmark. It consists of 22 business-oriented ad hoc queries against a simulated sales data warehouse of 8 tables. The benchmark measures query response time — from query submission to results returned.

Use the scale factor (SF) to control data volume: 1 SF equals 1 GB of raw data. Table sizes range from 1 GB to 3 TB depending on the SF you choose.

The SF value controls raw data volume. When sizing your disk, also account for index storage.

For the full specification, see TPC Benchmark H Standard Specification.

Prerequisites

Before you begin, make sure you have:

An Elastic Compute Service (ECS) instance with the following specifications:
- Instance type: ecs.g6e.4xlarge
- Operating system: CentOS 7.9
- Data disk: Enterprise SSD (ESSD), sized for your SF value and index storage
An EMR Serverless StarRocks instance. For reproducible results, create a new instance for each test rather than resizing an existing one.
The ECS instance and the EMR Serverless StarRocks instance in the same virtual private cloud (VPC) and region.

For setup instructions, see Create an ECS instance and Create an EMR Serverless StarRocks instance.

Run the TPC-H benchmark

Step 1: Download and configure the test package

Log in to the ECS instance. For instructions, see Connect to an ECS instance.

Download and decompress the benchmark package:

wget https://emr-olap.oss-cn-beijing.aliyuncs.com/packages/starrocks-benchmark-for-serverless.tar.gz
tar xzvf starrocks-benchmark-for-serverless.tar.gz

Go to the package directory:
```
cd starrocks-benchmark-for-serverless
```

Edit the configuration file:

vim group_vars/all

The configuration file contains the following parameters:

# mysql client config
login_host: fe-c-8764bab92bc6****-internal.starrocks.aliyuncs.com
login_port: 9030
login_user: admin
login_password: xxxx

# oss config
bucket: ""
endpoint: ""
access_key_id: ""
access_key_secret: ""

# benchmark config
scale_factor: 1
work_dir_root: /mnt/disk1/starrocks-benchmark/workdirs
dataset_generate_root_path: /mnt/disk1/starrocks-benchmark/datasets

Connection parameters (required):

Parameter	Description
`login_host`	Internal endpoint of the frontend (FE) on your EMR Serverless StarRocks instance. Find it on the Instance Details tab under FE Details > internal endpoint. Use the internal endpoint, not the public endpoint.
`login_port`	Query port of the FE. Default: `9030`. Find it on the Instance Details tab under FE Details > query port.
`login_user`	Initial username for logging in to the instance.
`login_password`	Password for logging in to the instance.

Object Storage Service (OSS) parameters (optional):

If specified, the generated dataset is stored in OSS.

Parameter	Description
`bucket`	Name of your OSS bucket.
`endpoint`	Endpoint for accessing OSS.
`access_key_id`	AccessKey ID of your Alibaba Cloud account.
`access_key_secret`	AccessKey secret of your Alibaba Cloud account.

Benchmark parameters:

Parameter	Default	Description
`scale_factor`	`1`	Data volume to generate. Unit: GB. 1 SF = 1 GB of raw data.
`work_dir_root`	`/mnt/disk1/starrocks-benchmark/workdirs`	Root directory for storing SQL statements and other test artifacts.
`dataset_generate_root_path`	`/mnt/disk1/starrocks-benchmark/datasets`	Path where the generated dataset is stored. If an OSS bucket is specified, it is mounted to this path.

Step 2: Run the test

Run the end-to-end TPC-H test:

bin/run_tpch.sh

This command creates the database, tables, and 22 SQL queries, generates the dataset, loads the data, and runs all queries.

You can also run individual phases:

Reload the dataset only:
```
bin/run_tpch.sh reload
```
Run the query test only:
```
bin/run_tpch.sh query
```

Step 3: View results

After bin/run_tpch.sh completes, query results are printed to the terminal. Each line shows the query name and the time taken.

The working directory path is also printed at the end. Switch to that directory to inspect query statements, table creation SQL, and run logs:

<work_dir>/
├── config          # Configurations for run.sh and run_mysql.sh
├── logs            # Most recent run logs
│   ├── *.sql.err
│   ├── *.sql.out
│   └── run.log
├── queries         # The 22 TPC-H SQL queries
│   ├── ddl
│   │   └── create_tables.sql
│   └── *.sql
├── run_mysql.sh
├── run.sh          # Queries run in the TPC-H performance test
└── tpch_tools      # dbgen toolkit

To browse logs directly:

cd <work_dir>/logs

In the reference test, the working directory is /mnt/disk1/starrocks-benchmark/workdirs/tpc_h/sf1.