Perform a TPC-H performance test in an EMR StarRocks cluster - E-MapReduce

TPC Benchmark-H (TPC-H) is a decision support benchmark of databases. TPC-H is published by the Transaction Processing Performance Council (TPC) and is widely used in academia and various industries. TPC-H is an important guideline that is used to select databases. This topic describes how to perform a TPC-H performance test in an E-MapReduce (EMR) StarRocks cluster.

TPC-H overview

The following description is quoted from the official documentation of TPC-H:

TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. The queries and the data that populates the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

For more information, see TPC BENCHMARK™ H Standard Specification.

Note

The TPC-H performance test described in this topic is implemented based on the TPC-H benchmark test but cannot meet all requirements of the TPC-H benchmark test. Therefore, the test results cannot be compared with the published results of the TPC-H benchmark test.

Prerequisites

An EMR StarRocks cluster is created. For more information, see Create a StarRocks cluster.
Optional. An Object Storage Service (OSS) bucket is created. For more information, see Create a bucket.
Note
If you want to store a generated test dataset in OSS, you must create a bucket in OSS.

Procedure

Obtain a toolkit.
1. Log on to the StarRocks cluster in SSH mode. For more information, see Log on to a cluster.
2. Run the following command to switch the user to emr-user:
```
su - emr-user
```
3. Run the following command to grant permissions to emr-user:
```
sudo chown emr-user:emr-user /home/emr-user
```
4. Run the following command to download the starrocks-benchmark.tgz package:
```
wget https://emr-public.oss-cn-beijing.aliyuncs.com/packages/starrocks-benchmark.tgz
```
5. Run the following command to decompress the starrocks-benchmark.tgz package:
```
tar xzf starrocks-benchmark.tgz
```

Configure parameters.

Run the following command to go to the starrocks-benchmark directory:
```
cd starrocks-benchmark
```

Optional. Run the vim group_vars/all command to modify the OSS parameters.

The following table describes the parameters.

Parameter	Description	Remarks
bucket	The name of the OSS bucket.	The OSS configurations. These parameters are optional. If you configure the parameters, the generated test dataset is stored in OSS.
endpoint	The endpoint that is used to access OSS.
access_key_id	The AccessKey ID of your Alibaba Cloud account.
access_key_secret	The AccessKey secret of your Alibaba Cloud account.
scale_factor	The scale factor of the dataset. Default value: 1.	The benchmark configurations.
work_dir_root	The root directory of the working directory. Default value: /mnt/disk1/starrocks-benchmark/workdirs.
dataset_generate_root_path	The path in which the generated test dataset is stored. Default value: /mnt/disk1/starrocks-benchmark/datasets. Note If an OSS bucket is specified, the bucket is mounted to this path.

Perform the test.
Run the following command to perform the TPC-H performance test:
```
bin/run_tpch.sh
```
The entire test process, including automatic generation and loading of datasets, SQL queries, and tables, and running of queries, is automatically performed.
Other operations:
- Run the following command to reload the dataset:
```
bin/run_tpch.sh reload
```
  Note
  This command is used in scenarios in which the dataset fails to be loaded and needs to be reloaded.
- Run the following command to perform the TPC-H query test:
```
bin/run_tpch.sh query
```

View the test result.

Check the overview of the test results.

After the bin/run_tpch.sh command is run, the test result is displayed. Sample test result:

TASK [tpc_h : debug] ****************************************************************************************************************************************
ok: [10.1.**.**] => {
    "command_output.stdout_lines": [
        "[info] 2022-03-01 09:51:23.295 | Run sql queries started.",
        "[info] 2022-03-01 09:51:23.330 | Run q10.sql started.",
        "[info] 2022-03-01 09:51:23.913 | Run q10.sql finished. Time taken: 0:00:00, .557 seconds",
        "[info] 2022-03-01 09:51:23.923 | Run q11.sql started.",
        "[info] 2022-03-01 09:51:24.026 | Run q11.sql finished. Time taken: 0:00:00, .100 seconds",
        "[info] 2022-03-01 09:51:24.038 | Run q12.sql started.",
        "[info] 2022-03-01 09:51:24.192 | Run q12.sql finished. Time taken: 0:00:00, .151 seconds",
        "[info] 2022-03-01 09:51:24.204 | Run q13.sql started.",
        "[info] 2022-03-01 09:51:24.553 | Run q13.sql finished. Time taken: 0:00:00, .347 seconds",
        "[info] 2022-03-01 09:51:24.563 | Run q14.sql started.",
        "[info] 2022-03-01 09:51:24.665 | Run q14.sql finished. Time taken: 0:00:00, .098 seconds",
        "[info] 2022-03-01 09:51:24.675 | Run q15.sql started.",
        "[info] 2022-03-01 09:51:24.852 | Run q15.sql finished. Time taken: 0:00:00, .175 seconds",
        "[info] 2022-03-01 09:51:24.864 | Run q16.sql started.",
        "[info] 2022-03-01 09:51:25.008 | Run q16.sql finished. Time taken: 0:00:00, .142 seconds",
        "[info] 2022-03-01 09:51:25.018 | Run q17.sql started.",
        "[info] 2022-03-01 09:51:25.269 | Run q17.sql finished. Time taken: 0:00:00, .248 seconds",
        "[info] 2022-03-01 09:51:25.280 | Run q18.sql started.",
        "[info] 2022-03-01 09:51:25.800 | Run q18.sql finished. Time taken: 0:00:00, .518 seconds",
        "[info] 2022-03-01 09:51:25.810 | Run q19.sql started.",
        "[info] 2022-03-01 09:51:25.943 | Run q19.sql finished. Time taken: 0:00:00, .130 seconds",
        "[info] 2022-03-01 09:51:25.953 | Run q1.sql started.",
        "[info] 2022-03-01 09:51:26.295 | Run q1.sql finished. Time taken: 0:00:00, .339 seconds",
        "[info] 2022-03-01 09:51:26.305 | Run q20.sql started.",
        "[info] 2022-03-01 09:51:26.708 | Run q20.sql finished. Time taken: 0:00:00, .400 seconds",
        "[info] 2022-03-01 09:51:26.720 | Run q21.sql started.",
        "[info] 2022-03-01 09:51:27.323 | Run q21.sql finished. Time taken: 0:00:00, .600 seconds",
        "[info] 2022-03-01 09:51:27.334 | Run q22.sql started.",
        "[info] 2022-03-01 09:51:27.403 | Run q22.sql finished. Time taken: 0:00:00, .065 seconds",
        "[info] 2022-03-01 09:51:27.415 | Run q2.sql started.",
        "[info] 2022-03-01 09:51:27.632 | Run q2.sql finished. Time taken: 0:00:00, .213 seconds",
        "[info] 2022-03-01 09:51:27.648 | Run q3.sql started.",
        "[info] 2022-03-01 09:51:27.917 | Run q3.sql finished. Time taken: 0:00:00, .262 seconds",
        "[info] 2022-03-01 09:51:27.936 | Run q4.sql started.",
        "[info] 2022-03-01 09:51:28.150 | Run q4.sql finished. Time taken: 0:00:00, .210 seconds",
        "[info] 2022-03-01 09:51:28.172 | Run q5.sql started.",
        "[info] 2022-03-01 09:51:28.954 | Run q5.sql finished. Time taken: 0:00:00, .778 seconds",
        "[info] 2022-03-01 09:51:28.976 | Run q6.sql started.",
        "[info] 2022-03-01 09:51:29.080 | Run q6.sql finished. Time taken: 0:00:00, .103 seconds",
        "[info] 2022-03-01 09:51:29.096 | Run q7.sql started.",
        "[info] 2022-03-01 09:51:29.445 | Run q7.sql finished. Time taken: 0:00:00, .346 seconds",
        "[info] 2022-03-01 09:51:29.460 | Run q8.sql started.",
        "[info] 2022-03-01 09:51:32.692 | Run q8.sql finished. Time taken: 0:00:03, 3.229 seconds",
        "[info] 2022-03-01 09:51:32.703 | Run q9.sql started.",
        "[info] 2022-03-01 09:51:33.318 | Run q9.sql finished. Time taken: 0:00:00, .611 seconds",
        "[info] 2022-03-01 09:51:33.324 | Run sql queries finished. Time taken: 0:00:10, 10.026 seconds"
    ]
}
TASK [tpc_h : debug] ********************************************************************************************************************************************************************************
ok: [10.1.0.91] => {
    "work_dir": "/mnt/disk1/starrocks-benchmark/workdirs/tpc_h/sf1"
}

View the details of the test result.
After the bin/run_tpch.sh command is successfully run, the system builds the working directory of the entire TPC-H performance test and returns the path of the <work_dir> directory. You can switch to this path to view the relevant information, such as query statements, table creation statements, and run logs.
Note
In this example, the path of the <work_dir> directory is /mnt/disk1/starrocks-benchmark/workdirs/tpc_h/sf1.
The following code shows the directory structure:
```
<work_dir>/
├── config # The configuration of the run.sh and run_mysql.sh scripts. 
├── logs # The most recent run log of the TPC-H performance test. 
│ ├── *.sql.err
│ ├── *.sql.out
│ └── run.log
├── queries # The 22 SQL queries involved in the TPC-H performance test. 
│ ├── ddl
│ │ └── create_tables.sql # The TPC-H table creation statement. 
│ └── *.sql
├── run_mysql.sh
├── run.sh # The queries that are fully executed in the TPC-H performance test. 
└── tpch_tools # The dbgen toolkit.
```
You can also run the cd <work_dir>/logs command to go to the logs directory in the working directory, and then run the following command to view the test result and the detailed execution result of SQL statements:
```
cat run.log
```