Run TPC-DS benchmark on an EMR cluster using OSS-HDFS - Object Storage Service

Alibaba Cloud E-MapReduce (EMR) is the first big data system certified to run the TPC-DS 100 TB benchmark. This topic walks you through the complete process: setting up an EMR cluster, generating TPC-DS data on OSS-HDFS, and running all 99 SQL statements using either Apache Hive or Apache Spark.

Use cases

Big data performance evaluation: Use TPC-DS as a decision support benchmark when evaluating OSS-HDFS as Hadoop-compatible storage, especially for large-scale data analysis and query workloads.
Data lake architecture verification: Run TPC-DS in EMR clusters to get objective, standardized results for your OSS-based data lake—covering complex query processing, extract, transform, and load (ETL) jobs, and data warehouse performance.
Capacity planning: When scaling out a cluster, upgrading hardware, or adjusting storage policies, TPC-DS gives you a consistent yardstick to measure how performance changes with data volume and compute resources.
Cost-effectiveness analysis: Measure OSS-HDFS performance against TPC-DS to determine whether OSS-HDFS fits your workload and budget.

Background

TPC-DS is a standard benchmark designed and maintained by the Transaction Processing Performance Council (TPC). The official TPC-DS tools target standalone environments and are not suited for large-scale distributed clusters. To run TPC-DS on EMR, you need the following:

Hive TPC-DS benchmark testing tool: Developed by Hortonworks, this tool is customized for Hive and Spark in the Hadoop ecosystem. It supports generating and running TPC-DS and TPC-H SQL in a cluster environment.
EMR cluster version 5.15.1 or later: This version supports the Hortonworks Data Platform (HDP) 3 series, which corresponds to Hive 3.1.

Prerequisites

Before you begin, make sure you have:

An Alibaba Cloud account with permissions to create EMR clusters
An OSS bucket with OSS-HDFS enabled
SSH access to the emr-master node

Step 1: Create an EMR cluster and download the testing tool

Create an EMR cluster

Create an EMR cluster of version 5.15.1 or later. Configure the following key parameters. For other parameters, see Create a cluster.

Important

The total data disk capacity of your core nodes must be more than three times the dataset size. For best query performance, use a big data or local SSD instance type for the emr-core node. A general-purpose instance with 4 vCPUs and 16 GiB of memory works for small-scale testing.

Category	Parameter	Value
Software configurations	Business Scenario	New Data Lake
	Metadata	DLF Unified Metadata
	Root Storage Directory of Cluster	Select a bucket with OSS-HDFS enabled
Hardware configurations	Node Group	Assign Public Network IP

EMR cluster creation page

Install Git and Maven

Log on to the emr-master node via SSH. For details, see Log on to a cluster.
Install Git:
```
   sudo yum install -y git
```
Install Apache Maven:
1. Download the latest binary archive (for example, apache-maven-3.9.6-bin.tar.gz) from the Apache Maven Project page and upload it to the emr-master node.
2. Extract the archive:
```
tar zxf apache-maven-3.9.6-bin.tar.gz
```
3. Configure environment variables:
```
cd apache-maven-3.9.6
export MAVEN_HOME=`pwd`
export PATH=`pwd`/bin:$PATH
```

Download the Hive TPC-DS benchmark testing tool

Download the tool using one of the following methods:
- Clone from GitHub (may be slow from regions in the Chinese mainland):
```
git clone https://github.com/hortonworks/hive-testbench.git
```
- Download the ZIP package: Download hive-testbench-hdp3.zip to your local machine, then upload it to the emr-master node.
Extract the ZIP package:
```
   unzip hive-testbench-hdp3.zip
```

Step 2: Compile and package the data generator

(Optional) Configure the Alibaba Cloud Maven mirror to speed up compilation in regions in the Chinese mainland. With this mirror, compilation finishes in 2–3 minutes.

   mkdir -p ~/.m2/
   cp $MAVEN_HOME/conf/settings.xml ~/.m2/

Add the following mirror entry to ~/.m2/settings.xml:

   <mirror>
       <id>aliyun</id>
       <mirrorOf>central</mirrorOf>
       <name>Nexus aliyun</name>
       <url>http://maven.aliyun.com/nexus/content/groups/public</url>
   </mirror>

Switch to the tool directory:
```
   cd hive-testbench-hdp3
```
Download the tpcds-extern.patch file, upload it to the current directory, and copy it to the patches directory:
```
   cp tpcds-extern.patch ./tpcds-gen/patches/all/
```
Compile and package the data generator:
```
   ./tpcds-build.sh
```

Step 3: Generate and load data

TPC-DS uses a scale factor (SF) to control dataset size, measured in GB. SF=1 produces a 1 GB dataset, SF=100 produces 100 GB, and SF=1000 produces 1 TB.

This example uses SF=3 (3 GB) as a quick demonstration—data generation and loading takes approximately 40–50 minutes at this scale.

Important

Core node total data disk capacity must exceed three times the dataset size. For SF=3, ensure you have at least 9 GB of total core node disk space.

Set the scale factor:
```
   SF=3
```
Check whether the target Hive database already exists:
```
   hive -e "desc database tpcds_bin_partitioned_orc_$SF"
```
If the database exists, drop it before proceeding. Leaving it in place causes errors in subsequent steps:
```
   hive -e "drop database tpcds_bin_partitioned_orc_$SF cascade"
```
Update the Hive service URL in the setup script. The default URL in tpcds-setup.sh points to a ZooKeeper-based discovery endpoint that does not match the EMR cluster configuration. Replace it with the direct endpoint:
```
   sed -i 's/localhost:2181\/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default/master-1-1:10000\//' tpcds-setup.sh
```
This changes the URL from jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default to jdbc:hive2://master-1-1:10000/.
Fix a Hive parameter compatibility issue. The parameter hive.optimize.sort.dynamic.partition.threshold=0 is not supported in Hive 2 or Hive 3 and causes job failures. Replace it with the supported equivalent:
```
   sed -i 's/hive.optimize.sort.dynamic.partition.threshold=0/hive.optimize.sort.dynamic.partition=true/' settings/*.sql
```
Generate and load data. On completion, the TPC-DS data tables are created in the tpcds_bin_partitioned_orc_$SF database and the data files are saved to the root storage path of your OSS-HDFS bucket.
```
   ./tpcds-setup.sh $SF
```
Collect Hive table statistics. Without statistics, Hive's query optimizer cannot choose optimal execution plans, which significantly degrades query performance. For SF=3, this step takes approximately 20–30 minutes.
```
   hive -f ./hive-testbench-hdp3/ddl-tpcds/bin_partitioned/analyze.sql \
       --hiveconf hive.execution.engine=tez \
       --database tpcds_bin_partitioned_orc_$SF
```

Step 4: Run TPC-DS SQL statements

TPC-DS has 99 SQL queries. Run them using Hive or Spark.

Use Hive to run TPC-DS queries

All 99 Hive SQL files are in the sample-queries-tpcds directory. With SF=3, each query returns results within 5 minutes.

Note

TPC-DS queries and data are randomly generated. Some SQL statements may return no records—this is expected.

Run a single query:

cd sample-queries-tpcds
hive --database tpcds_bin_partitioned_orc_$SF
set hive.execution.engine=tez;
source query10.sql;

Run all 99 queries in sequence:

cd ~/hive-testbench-hdp3
echo 'set hive.execution.engine=tez;' > sample-queries-tpcds/testbench.settings
./runSuite.pl tpcds $SF

Use Spark to run TPC-DS queries

The tool provides sample Spark SQL files in the spark-queries-tpcds directory. This example uses Spark Beeline connected to Spark Thrift Server.

Note

EMR Spark supports storing tables across multiple storage systems (HDFS and OSS) and storing metadata in DLF.

Collect Hive table statistics via Spark Beeline. Without statistics, the query optimizer cannot select optimal execution plans, which degrades query performance.
```
   cd ~/hive-testbench-hdp3
   spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF \
     -f ./ddl-tpcds/bin_partitioned/analyze.sql
```
Switch to the Spark SQL directory:
```
   cd spark-queries-tpcds/
```

Run a single query:

   spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF -f q1.sql

Run all 99 queries in sequence:
Important
In q30.sql, the column name c_last_review_date_sk is incorrectly written as c_last_review_date. This causes query 30 to fail—this is a known issue in the tool.
```
   for q in `ls *.sql`; do
     spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF -f $q > $q.out
   done
```

FAQ

Spark SQL job submissions fail when running all 99 queries

The default memory allocation for Spark Thrift Server is too low for large datasets. If job submissions fail, increase the spark_thrift_daemon_memory parameter and restart Spark Thrift Server:

In the EMR console, click EMR on ECS in the left-side navigation pane.
In the top navigation bar, select the region where your cluster resides and select a resource group.
Find your cluster and click Services in the Actions column.
On the Services tab, find Spark3 and click Configure.
Search for spark_thrift_daemon_memory and increase its value based on your dataset size. Click Save, enter an execution reason in the dialog box, and click Save again.
Restart Spark: choose More > Restart in the upper-right corner of the Spark service page, enter an execution reason, and click OK. In the Confirm dialog box, click OK.