All Products
Search
Document Center

Object Storage Service:Run a TPC-DS benchmark on EMR with OSS-HDFS

Last Updated:May 07, 2026

The TPC-DS benchmark is a widely recognized industry standard for measuring the performance and efficiency of large-scale data processing systems. Alibaba Cloud E-MapReduce (EMR) is the first big data system certified to run the 100 TB TPC-DS benchmark. This guide walks you through how to run the 99 TPC-DS SQL queries on an EMR cluster using OSS-HDFS, and how to apply optimization strategies for better performance.

Use cases

  • Big data performance evaluation

    TPC-DS is a standard decision-support benchmark for evaluating the performance of big data workloads that use OSS-HDFS for data storage. It is especially useful for large-scale data analysis and query optimization workloads.

  • Data lake architecture validation

    For data lake architectures built on Alibaba Cloud Object Storage Service (OSS), running the TPC-DS benchmark on an EMR cluster provides objective, standardized results to verify the efficiency of your architecture for complex query processing, ETL jobs, and data warehouse performance.

  • Understand performance trends

    When you scale out a cluster, upgrade hardware, or adjust your storage strategy, the TPC-DS benchmark helps you understand how your system's overall performance changes with data volume and computing resources.

  • Cost-benefit analysis

    OSS offers cost advantages over traditional HDFS. By measuring the performance of OSS-HDFS with the TPC-DS benchmark, you can analyze its cost-effectiveness for your specific business scenarios and make more economical decisions.

Background

TPC-DS is a global standard for evaluating data management systems, designed and maintained by the TPC. However, its official toolset is primarily for generating data and running SQL queries on a single machine, which is not suitable for performance evaluation in large-scale distributed environments. To run the benchmark in a big data analytics scenario, you need the following tools and an EMR cluster:

  • Hive TPC-DS benchmark tool

    This tool, developed by Hortonworks, is customized for components like Hive and Spark in the Hadoop ecosystem. It effectively simulates big data query challenges and supports generating and executing complex SQL queries from the TPC-DS and TPC-H standards.

  • EMR cluster running version 5.15.1 or later

    You must use an EMR cluster that runs version 5.15.1 or later. These versions are compatible with the Hortonworks Data Platform (HDP) 3 series, which corresponds to Hive 3.1.

Step 1: Create an EMR cluster and download the TPC-DS tool

  1. Create an EMR cluster that runs version 5.15.1 or later.

    Dingtalk_20240125172541.jpg

    When you create the EMR cluster, note the following configurations. For information about other parameters, see Create a cluster.

    Category

    Parameter

    Description

    Software settings

    Business scenario

    Select New Data Lake.

    Metadata

    Select DLF Unified Metadata.

    Root storage directory of cluster

    Select a bucket for which OSS-HDFS is enabled.

    Hardware settings

    Node group

    For the master node, enable Assign Public Network IP.

    For optimal performance, select a big data or local SSD instance type for the core nodes. If you want to quickly complete the process with a small dataset, you can also select a general-purpose instance type with 4 vCPUs and 16 GiB of memory for the core nodes.

    Important

    Determine the cluster size based on the dataset you choose to run. Ensure that the total data disk capacity of the core nodes is more than three times the size of your dataset. For more information about datasets, see Step 3: Generate and load data.

  2. Connect to the master node of the EMR cluster over SSH. For more information, see Connect to a cluster.

  3. Install Git and Maven.

    1. Run the following command to install Git.

      sudo yum install -y git
    2. Download the latest binary tar.gz archive, such as apache-maven-3.9.6-bin.tar.gz, from the Apache Maven Project page.

    3. Upload the downloaded file to the master node of the EMR cluster and decompress it.

      tar zxf apache-maven-3.9.6-bin.tar.gz
    4. Configure the environment variables.

      1. Go to the apache-maven-3.9.6 directory.

        cd apache-maven-3.9.6
      2. Configure the environment variables.

        export MAVEN_HOME=`pwd`
        export PATH=`pwd`/bin:$PATH
  4. Download the TPC-DS benchmark tool.

    1. Download the tool.

      • Download from GitHub

        Access to GitHub from the Chinese mainland can be slow. If the download fails, you can download the tool locally.

        git clone https://github.com/hortonworks/hive-testbench.git
      • Download the hive-testbench-hdp3.zip file locally.

    2. Upload the ZIP file to the master node of the EMR cluster.

    3. On the master node of the EMR cluster, run the following command to decompress the ZIP file:

      unzip hive-testbench-hdp3.zip

Step 2: Compile and package the data generator

  1. (Optional) Configure an Alibaba Cloud mirror.

    If you are in mainland China, you can use an Alibaba Cloud mirror to speed up Maven compilation. This reduces the compilation and packaging time for the data generator to 2 to 3 minutes.

    1. Run the following command to create a directory.

      mkdir -p ~/.m2/
    2. Run the following command to copy the Maven configuration file to the new directory.

      cp $MAVEN_HOME/conf/settings.xml ~/.m2/
    3. In the ~/.m2/settings.xml file, add the mirror information as shown below.

      <mirror>
          <id>aliyun</id>
          <mirrorOf>central</mirrorOf>
          <name>Nexus aliyun</name>
          <url>http://maven.aliyun.com/nexus/content/groups/public</url>
      </mirror>
  2. Go to the hive-testbench-hdp3 directory.

    cd hive-testbench-hdp3
  3. Download the tpcds-extern.patch file, upload it to the current directory, and then run the following command to copy it to the tpcds-gen/patches/all/ directory:

    cp tpcds-extern.patch ./tpcds-gen/patches/all/
  4. Use the TPC-DS toolset to compile and package the data generator.

    ./tpcds-build.sh

Step 3: Generate and load data

  1. Set the scale factor (SF).

    The scale factor (SF) defines the dataset size, where SF=1 corresponds to 1 GB. For example, SF=1 means 1 GB, SF=100 means 100 GB, and SF=1000 means 1 TB. This tutorial uses a small dataset with a recommended SF of 3. Run the following command:

    SF=3
    Important

    Ensure that the total data disk capacity is at least three times the dataset size. Otherwise, errors may occur in subsequent steps.

  2. Check and clean up the Hive database.

    1. Check whether the Hive database exists.

      hive -e "desc database tpcds_bin_partitioned_orc_$SF"
    2. (Optional) Clean up an existing Hive database.

      Important

      If the tpcds_bin_partitioned_orc_$SF Hive database already exists, you must run the following command to drop it. Otherwise, subsequent steps will fail. If the database does not exist, skip this step.

      hive -e "drop database tpcds_bin_partitioned_orc_$SF cascade"
  3. Configure the Hive service URL.

    The default Hive service URL in the tpcds-setup.sh script is inconsistent with the EMR cluster environment. Therefore, you must replace the Hive service URL in the script with the one from your EMR cluster. The specific command is as follows.

    sed -i 's/localhost:2181\/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default/master-1-1:10000\//' tpcds-setup.sh

    The default Hive service URL configured in the script is: jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default. The Hive service URL after it is replaced by using the preceding command is: jdbc:hive2://master-1-1:10000/.

  4. Fix an open source tool configuration issue.

    Some parameters are not supported in open source versions such as Hive 2 and Hive 3. Using the TPC-DS tool without this change may cause job failures. Run the following command to replace the parameters.

    sed -i 's/hive.optimize.sort.dynamic.partition.threshold=0/hive.optimize.sort.dynamic.partition=true/' settings/*.sql
  5. Generate and load the data.

    When SF is 3, this step takes about 40 to 50 minutes. If the process runs correctly, the TPC-DS data tables are loaded into the tpcds_bin_partitioned_orc_$SF database. The generated data is automatically saved to the root storage directory that you specified when you created the EMR cluster, which is the root directory of the bucket where OSS-HDFS is enabled. Run the following command to load the data:

    ./tpcds-setup.sh $SF
  6. Get Hive table statistics.

    Run the Hive SQL ANALYZE command to gather Hive table statistics, which can speed up subsequent queries. With SF=3, this step takes about 20 to 30 minutes.

    hive -f ./ddl-tpcds/bin_partitioned/analyze.sql \
        --hiveconf hive.execution.engine=tez \
        --database tpcds_bin_partitioned_orc_$SF

Step 4: Run the TPC-DS SQL queries

This step describes how to run TPC-DS SQL queries using Hive and Spark.

Run TPC-DS queries with Hive

  1. Run a single SQL query using the following commands.

    All 99 TPC-DS SQL files are stored in the sample-queries-tpcds working directory, including files such as query10.sql and query11.sql. At a scale factor (SF) of 3, all SQL queries can return results within 5 minutes.

    Important

    Because TPC-DS queries and data are randomly generated, it is normal for some SQL queries to return zero results.

    cd sample-queries-tpcds
    hive --database tpcds_bin_partitioned_orc_$SF
    set hive.execution.engine=tez;
    source query10.sql;
  2. To run all 99 SQL queries sequentially with the provided script, run the following commands:

    cd ~/hive-testbench-hdp3
    # Generate a Hive configuration file and specify Tez as the execution engine.
    echo 'set hive.execution.engine=tez;' > sample-queries-tpcds/testbench.settings
    ./runSuite.pl tpcds $SF

批量执行SQL

Run TPC-DS queries with Spark

This section demonstrates how to run TPC-DS SQL queries on the dataset from Step 3 using Spark Beeline connected to a Spark Thrift server.

Note

E-MapReduce Spark supports data tables stored on various media, such as HDFS and OSS, and supports metadata from Data Lake Formation (DLF).

  1. Run the Spark Beeline ANALYZE command to collect Hive table statistics. This accelerates subsequent SQL queries.

    cd ~/hive-testbench-hdp3
    spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF \
      -f ./ddl-tpcds/bin_partitioned/analyze.sql
  2. Switch to the directory where the sample Spark SQL queries are located.

    cd spark-queries-tpcds/
  3. Run a single SQL query using the following command.

    spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF -f q1.sql
  4. Run all 99 SQL queries sequentially using a script.

    The TPC-DS toolset does not include a script for running Spark SQL queries in a batch. You can use the following sample script.

    for q in `ls *.sql`; do
      spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF -f $q > $q.out
    done
    Important

    In the q30.sql file, the column name c_last_review_date_sk is misspelled as c_last_review_date. Therefore, this SQL script is expected to fail.

FAQ

Spark SQL OOM errors

The default memory of the Spark Thrift server is not suitable for testing large datasets. If a Spark SQL job fails to submit, it might be due to an out-of-memory (OOM) error in the Spark Thrift server. To resolve this, increase the value of the spark_thrift_daemon_memory parameter for the Spark service and then restart the Spark Thrift server.

  1. Go to the Spark service page.

    1. Log on to the E-MapReduce console.

    2. In the top navigation bar, select your region and resource group.

    3. In the row for the target cluster, click Services in the Actions column.

    4. On the Services tab, click Configure in the Spark3 service section.

  2. Adjust the value of the spark_thrift_daemon_memory parameter.

    1. Search for the spark_thrift_daemon_memory parameter.

    2. Adjust the value based on the size of your dataset.

      You can increase the default value.

    3. Click Save.

    4. In the dialog box that appears, enter an execution reason and click Save.

  3. Restart Spark.

    1. On the Spark service page, choose More > Restart in the upper-right corner.

    2. In the dialog box that appears, enter an Execution Reason and click OK.

    3. In the Confirm dialog box, click OK.