TPC-DS is one of the most well-known benchmarks that are used to measure the performance of big data systems. Alibaba Cloud E-MapReduce (EMR) has refreshed the best official results of TPC-DS many times. EMR is the only big data system that is certified as capable of running the TPC-DS 100 TB benchmark. This topic describes how to execute the 99 SQL statements of TPC-DS and achieve the best performance.

Background information

TPC-DS is a standard benchmark formulated by Transaction Processing Performance Council (TPC), the most well-known organization that defines measurement benchmarks for data management systems. The measurement results of this benchmark are also published by TPC. The official tools of TPC-DS include only a query generator and a standalone data generator, which are not suitable for big data scenarios. In the steps that are provided in this topic, the following tool and EMR version are used:
  • Hive TPC-DS benchmark testing tool

    This tool is the most commonly used testing tool in the industry. It is developed by Hortonworks and allows you to use Hive and Spark to run benchmarks such as TPC-DS or TPC-H.

  • EMR V4.8.0

    The Hive TPC-DS benchmark testing tool is developed based on Hortonworks HDP 3, which corresponds to Hive 3.1. Therefore, the most suitable EMR version is V4.X. In this topic, an EMR cluster of V4.8.0 is used as an example. If you use an EMR cluster of V4.8.0 or later, you can perform the operations in this topic to run the TPC-DS benchmark.

Procedure

Step 1: Create an EMR cluster and download the Hive TPC-DS benchmark testing tool

  1. Create a cluster of EMR V4.8.0. For more information, see Create a cluster.
    Take note of the following configuration requirements:
    • Cluster Type: Select Hadoop.
    • Instance: If you want to achieve the best performance, we recommend that you select a big data or local SSD instance type for the core nodes. If you want to use a small amount of data to complete all processes in a short period of time, you can also select a general-purpose instance type that has 4 vCPUs and 16 GiB of memory for the core nodes.
      Notice You can determine the cluster size based on the dataset that you want to use. Make sure that the total capacity of the data disks of the core nodes is more than three times the size of the dataset. For information about datasets, see Step 3: Generate and load data.
    • Type: We recommend that you select Data Lake Metadata.
    • Assign Public IP Address: Turn on this switch.
  2. Log on to the created EMR cluster in SSH mode. For more information, see Connect to the master node of an EMR cluster in SSH mode.
  3. Install Git and Maven.
    sudo yum install -y git maven
  4. Download the Hive TPC-DS benchmark testing tool.
    • Download the tool from GitHub.
      git clone https://github.com/hortonworks/hive-testbench.git
      Notice If you run the command on a node that resides in a region in mainland China, the download may be slow. If the download fails, you can use the following method.
    • Download the ZIP package, upload the package to the EMR cluster, and then decompress the package.
      Perform the following steps:
      1. Download the hive-testbench-hdp3.zip package.
      2. Change the name of the downloaded package from hive-testbench-hdp3.zip to hive-testbench.zip.
      3. Upload the ZIP package to the EMR cluster. For example, if you use a Linux operating system, you can run the following command to upload the ZIP package:
        scp hive-testbench.zip root@xx.xx.xx.xx:/root/
        Note xx.xx.xx.xx indicates the public IP address of the master node of the EMR cluster. For more information about how to obtain the public IP address, see Obtain the public IP address of the master node.
      4. Decompress the uploaded ZIP package.
        unzip hive-testbench.zip

Step 2: Compile and package a data generator

  1. Optional. Configure an Alibaba Cloud image.
    You can use an image provided by Alibaba Cloud to accelerate Maven compilation in regions in mainland China. If the image is used, a data generator can be compiled and packaged in 2 to 3 minutes.
    1. Run the following command to create a directory:
      mkdir -p ~/.m2/
    2. Run the following command to copy the configuration file of Maven to the new directory:
      cp /usr/share/maven/conf/settings.xml ~/.m2/
    3. Add the following image information to the ~/.m2/settings.xml file:
      <mirror>
          <id>aliyun</id>
          <mirrorOf>central</mirrorOf>
          <name>Nexus aliyun</name>
          <url>http://maven.aliyun.com/nexus/content/groups/public</url>
      </mirror>
  2. Switch to the hive-testbench directory.
    cd hive-testbench
  3. Use the toolset of TPC-DS to compile and package a data generator.
    ./tpcds-build.sh

Step 3: Generate and load data

  1. Specify a scale factor (SF).

    An SF is used to specify the size of a dataset. The size is measured in GB. For example, SF=1 indicates a 1 GB dataset, SF=100 indicates a 100 GB dataset, and SF=1000 indicates a 1 TB dataset. In this example, a small dataset is used, and SF is set to 3. Command:

    SF=3
    Notice Make sure that the total capacity of data disks of the core nodes is more than three times the size of the dataset. Otherwise, an error is reported in subsequent operations.
  2. Check and clean up the Hive database that you want to use.
    1. Check whether the Hive database that you want to use exists.
      hive -e "desc database tpcds_bin_partitioned_orc_$SF"
    2. Optional. If the database exists, clean up the database.
      Notice If the tpcds_bin_partitioned_orc_$SF database exists, you must run the following command to clean up the database. Otherwise, an error is reported in subsequent operations. If the database does not exist, skip this step.
      hive -e "drop database tpcds_bin_partitioned_orc_$SF cascade"
  3. Configure a Hive service URL.
    The default Hive service URL configured in the tpcds-setup.sh script file is inconsistent with the Hive service URL configured in the EMR cluster. You must run the following command to replace the default Hive service URL with the Hive service URL configured in the EMR cluster:
    sed -i 's/localhost:2181\/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default/emr-header-1:10000\//' tpcds-setup.sh

    The default Hive service URL in the script file is jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default. After you run the preceding command, the Hive service URL is changed to jdbc:hive2://emr-header-1:10000/.

  4. Generate and load data.
    If SF is set to 3, data can be generated and loaded in approximately 40 to 50 minutes. If this step is successful, the generated TPC-DS data table is loaded to the tpcds_bin_partitioned_orc_$SF database. You can store the generated data in HDFS or Object Storage Service (OSS):
    • Store the generated data in HDFS
      By default, EMR clusters store TPC-DS data in HDFS. Run the following command to generate and load data:
      ./tpcds-setup.sh $SF
      Initialize
    • Store the generated data in OSS
      The compute-storage separated architecture of EMR simplifies the operations of storing data in OSS. You can perform the following steps to store data in OSS:
      1. Run the following commands to modify the database path:
        hive --hivevar SF=$SF
        create database if not exists tpcds_bin_partitioned_orc_${SF};
        alter database tpcds_bin_partitioned_orc_${SF} set location 'oss://<bucket-name>/warehouse/tpcds_bin_partitioned_orc_${SF}.db';

        You must replace <bucket-name> in the preceding command with the name of an OSS bucket that resides in the same region as the EMR cluster.

      2. Run the following commands to generate and load data:
        cd ~/hive-testbench
        ./tpcds-setup.sh $SF
        Note The generated data is stored in the specified OSS bucket. If you use Data Lake Formation (DLF) to store the metadata of Hive tables, you can release the EMR cluster based on your business requirements after data is generated. After the cluster is released, you can query the generated TPC-DS data in the other EMR clusters that are in the same region as the released cluster.
  5. Obtain Hive table statistics.
    We recommend that you use the Hive SQL ANALYZE command to obtain Hive table statistics. This helps speed up subsequent SQL queries. If SF is set to 3, the Hive table statistics can be obtained in approximately 20 to 30 minutes.
    hive -f ./ddl-tpcds/bin_partitioned/analyze.sql \
        --hiveconf hive.execution.engine=tez \
        --database tpcds_bin_partitioned_orc_$SF

Step 4: Execute TPC-DS SQL statements

You can use Hive or Spark to execute TPC-DS SQL statements.

  • Use Hive to execute TPC-DS SQL statements
    • Execute a single SQL statement.
      TPC-DS has a total number of 99 SQL files, such as query10.sql and query11.sql. All the files are placed in the sample-queries-tpcds directory. If SF is set to 3, every TPC-DS SQL statement can return output within 5 minutes.
      Notice TPC-DS queries and TPC-DS data are randomly generated. Therefore, some SQL statements may return no records.
      cd sample-queries-tpcds
      hive --database tpcds_bin_partitioned_orc_$SF
      set hive.execution.engine=tez;
      source query10.sql;
    • Execute all the 99 SQL statements in sequence by using the script file that is provided in the toolset of TPC-DS. Example:
      cd ~/hive-testbench
      # Generate a Hive configuration file and set the Hive execution engine to Tez. 
      echo 'set hive.execution.engine=tez;' > sample-queries-tpcds/testbench.settings
      ./runSuite.pl tpcds $SF
      Execute all the 99 SQL statements in sequence
  • Use Spark to execute TPC-DS SQL statements

    The toolset of TPC-DS provides some sample Spark SQL statements in the spark-queries-tpcds directory. You can use a command-line tool, such as Spark SQL or Spark Beeline, to execute the sample statements. In this step, Spark Beeline, which is connected to Spark Thrift Server, is used as an example. This example shows how to execute TPC-DS SQL statements to query the TPC-DS dataset that is generated in Step 3.

    Note EMR Spark allows you to store tables in multiple storage media, such as HDFS and OSS, and allows you to store metadata in DLF.
    1. Run the Spark Beeline ANALYZE command to obtain Hive table statistics. This helps speed up subsequent SQL queries.
      cd ~/hive-testbench
      spark-beeline -u jdbc:hive2://emr-header-1:10001/tpcds_bin_partitioned_orc_$SF \
        -f ./ddl-tpcds/bin_partitioned/analyze.sql
    2. Switch to the directory in which the sample Spark SQL statements are placed.
      cd spark-queries-tpcds/
    3. Execute SQL statements.
      • Execute a single SQL statement.
        spark-beeline -u jdbc:hive2://emr-header-1:10001/tpcds_bin_partitioned_orc_$SF -f q1.sql
      • Execute all the 99 SQL statements in sequence.
        The toolset of TPC-DS does not contain a script file that can be used to execute all the Spark SQL statements at a time. You can use the following simple script for reference:
        for q in `ls *.sql`; do
          spark-beeline -u jdbc:hive2://emr-header-1:10001/tpcds_bin_partitioned_orc_$SF -f $q > $q.out
        done
        Notice
        • In the q30.sql file, the column name c_last_review_date_sk is written as c_last_review_date by mistake. Therefore, the thirtieth SQL statement fails.
        • If an error is reported when you use the script to execute all the 99 SQL statements in sequence, you may find a solution in FAQ.

FAQ

Q: What do I do if the following error is reported when I use a script to execute all the 99 Spark SQL statements in sequence: Error: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed) (state=08S01,code=0) Closing: 0: jdbc:hive2://emr-header-1:10001/tpcds_bin_partitioned_orc_**?

A: The default memory of Spark Thrift Server is not suitable to test a large dataset. If the Spark SQL job fails to be submitted during testing, Spark Thrift Server may be out of memory. To address this issue, you can set the spark_thrift_daemon_memory parameter for the Spark service to a larger value and restart Spark Thrift Server. Perform the following steps:
  1. Go to the Spark service page.
    1. Log on to the Alibaba Cloud EMR console.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Cluster Management tab.
    4. On the Cluster Management page, find your cluster and click Details in the Actions column.
    5. In the left-side navigation pane, choose Cluster Service > Spark.
  2. Set the spark_thrift_daemon_memory parameter to a larger value.
    1. On the Spark service page, click the Configure tab.
    2. In the Configuration Filter section, enter spark_thrift_daemon_memory in the search box and click the Search icon.
    3. Adjust the value of the parameter based on the size of your dataset.

      The default value is 1g. You can set this parameter to 2g or a larger value.

    4. In the upper-right corner of the Service Configuration section, click Save.
    5. In the Confirm Changes dialog box, specify Description and click OK.
  3. Restart Spark Thrift Server.
    1. In the upper-right corner of the Spark service page, choose Actions > Restart ThriftServer.
    2. In the Cluster Activities dialog box, specify Description and click OK.
    3. In the Confirm message, click OK.