This topic describes how to import test datasets of open source ClickHouse into ApsaraDB for ClickHouse and use the datasets to test the performance of ApsaraDB for ClickHouse. The test datasets are available on the website of open source ClickHouse.

ApsaraDB for ClickHouse is a high-performance column-oriented analytical database service. It provides excellent performance in scenarios such as aggregate data analysis and wide table analysis. The website of open source ClickHouse provides a number of open source test datasets. In this topic, two benchmark test datasets are used to illustrate how to import data and test performance. The purpose is to provide a method for verifying the performance of ApsaraDB for ClickHouse.

Prepare a test environment

Prepare a server that runs the Linux operating system. Install clickhouse-client on the server based on the edition of your ApsaraDB for ClickHouse cluster.

Import the OnTime dataset and test performance

Step 1: Create a table

Create a table in your ApsaraDB for ClickHouse cluster by using one of the following scripts. Select a script based on the edition of your cluster.

Script for creating a table in an ApsaraDB for ClickHouse cluster of the Single-replica Edition

Script for creating a table in an ApsaraDB for ClickHouse cluster of the High-availability Edition

Step 2: Download data

Copy the following shell commands and save them in a script file named download.sh. Then, run the sh download.sh command.

Notice The OnTime dataset covers an extremely large time range. The download of the data takes a relatively long time. In addition, an error will be reported during the download process, indicating that some data is lost.
for s in `seq 1987 2018`
do
for m in `seq 1 12`
do
wget https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_${s}_${m}.zip
done
done

Step 3: Import the data

Run the following shell commands to configure the ck_url, ck_user, ck_pass, and ck_port environment variables and import the data. Specify the actual values of the environment variables in the commands.

export ck_url='XXX'
export ck_user='XXX'
export ck_pass='XXX'
export ck_port='XXX'
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client -h $ck_url -u $ck_user --password $ck_pass --port $ck_port --query="INSERT INTO ontime FORMAT CSVWithNames"; done

Step 4: Test performance

--Q0

SELECT avg(c1)
FROM
(
    SELECT Year, Month, count(*) AS c1
    FROM ontime
    GROUP BY Year, Month
);

--Q1 The number of flights per day from the year 2000 to 2008

SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;

--Q2 The number of flights delayed by more than 10 minutes, grouped by the day of the week, for 2000-2008

SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;

--Q3 The number of delays by airport for 2000-2008

SELECT Origin, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY Origin
ORDER BY c DESC
LIMIT 10;

--Q4 The number of delays by carrier for 2007

SELECT Carrier, count(*)
FROM ontime
WHERE DepDelay>10 AND Year=2007
GROUP BY Carrier
ORDER BY count(*) DESC;

--Q5 The percentage of delays by carrier for 2007

set any_join_distinct_right_table_keys=1;
SELECT Carrier, c, c2, c*100/c2 as c3
FROM
(
   SELECT
       Carrier,
       count(*) AS c
   FROM ontime
   WHERE DepDelay>10
       AND Year=2007
   GROUP BY Carrier
)
ANY INNER JOIN
(
   SELECT
       Carrier,
       count(*) AS c2
   FROM ontime
   WHERE Year=2007
   GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;

--Q6 The previous request for a broader range of years, 2000-2008

SELECT Carrier, c, c2, c*100/c2 as c3
FROM
(
   SELECT
       Carrier,
       count(*) AS c
   FROM ontime
   WHERE DepDelay>10
       AND Year>=2000 AND Year<=2008
   GROUP BY Carrier
)
ANY INNER JOIN
(
   SELECT
       Carrier,
       count(*) AS c2
   FROM ontime
   WHERE Year>=2000 AND Year<=2008
   GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;

--Q7 Percentage of flights delayed for more than 10 minutes, by year

SELECT Year, c1/c2
FROM
(
   select
       Year,
       count(*)*100 as c1
   from ontime
   WHERE DepDelay>10
   GROUP BY Year
)
ANY INNER JOIN
(
   select
       Year,
       count(*) as c2
   from ontime
   GROUP BY Year
) USING (Year)
ORDER BY Year;

--Q8 The most popular destinations by the number of directly connected cities for various year ranges

SELECT DestCityName, uniqExact(OriginCityName) AS u
FROM ontime
WHERE Year>=2000 and Year<=2010
GROUP BY DestCityName
ORDER BY u DESC
LIMIT 10;

--Q9

SELECT Year, count(*) AS c1
FROM ontime
GROUP BY Year;

--Q10

SELECT
  min(Year), max(Year), Carrier, count(*) AS cnt,
  sum(ArrDelayMinutes>30) AS flights_delayed,
  round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
FROM ontime
WHERE
  DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
  AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
  AND FlightDate < '2010-01-01'
GROUP by Carrier
HAVING cnt>100000 and max(Year)>1990s
ORDER by rate DESC
LIMIT 1000;

Import the Star Schema dataset and test performance

Step 1: Create tables

Create tables in your ApsaraDB for ClickHouse cluster by using one of the following scripts. Select a script based on the edition of your cluster.

Script for creating tables in an ApsaraDB for ClickHouse cluster of the Single-replica Edition

Script for creating tables in an ApsaraDB for ClickHouse cluster of the High-availability Edition

Step 2: Generate data

Run the following shell commands to clone the data generation project, compile the project by using make, and then generate data files. In this example, four data files are generated: customer.tbl, lineorder.tbl, part.tbl, and supplier.tbl.

Notice In this example, the -s parameter is set to 100, which means that around 600 million rows of data are to be generated by dbgen. In this case, the volume of the data is around 67 GB. To adjust the volume of data to be generated, you can set the value of the -s parameter based on actual needs.
git clone git@github.com:vadimtk/ssb-dbgen.git
cd ssb-dbgen
make
./dbgen -s 100 -T c
./dbgen -s 100 -T l
./dbgen -s 100 -T p
./dbgen -s 100 -T s
./dbgen -s 100 -T d

Step 3: Import the data

Run the following shell commands to configure the ck_url, ck_user, ck_pass, and ck_port environment variables and import the data. Specify the actual values of the environment variables in the commands.

export ck_url=''
export ck_user=''
export ck_pass=''
export ck_port=''
clickhouse-client -h $ck_url -u $ck_user --password $ck_pass --port $ck_port --query "INSERT INTO customer FORMAT CSV" < customer.tbl
clickhouse-client -h $ck_url -u $ck_user --password $ck_pass --port $ck_port --query "INSERT INTO part FORMAT CSV" < part.tbl
clickhouse-client -h $ck_url -u $ck_user --password $ck_pass --port $ck_port --query "INSERT INTO supplier FORMAT CSV" < supplier.tbl
clickhouse-client -h $ck_url -u $ck_user --password $ck_pass --port $ck_port --query "INSERT INTO lineorder FORMAT CSV" < lineorder.tbl

Step 4: Test performance

-- Data cleansing based on extract, transform, and load (ETL)

INSERT INTO lineorder_flat
SELECT
    l.LO_ORDERKEY AS LO_ORDERKEY,
    l.LO_LINENUMBER AS LO_LINENUMBER,
    l.LO_CUSTKEY AS LO_CUSTKEY,
    l.LO_PARTKEY AS LO_PARTKEY,
    l.LO_SUPPKEY AS LO_SUPPKEY,
    l.LO_ORDERDATE AS LO_ORDERDATE,
    l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
    l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
    l.LO_QUANTITY AS LO_QUANTITY,
    l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
    l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
    l.LO_DISCOUNT AS LO_DISCOUNT,
    l.LO_REVENUE AS LO_REVENUE,
    l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
    l.LO_TAX AS LO_TAX,
    l.LO_COMMITDATE AS LO_COMMITDATE,
    l.LO_SHIPMODE AS LO_SHIPMODE,
    c.C_NAME AS C_NAME,
    c.C_ADDRESS AS C_ADDRESS,
    c.C_CITY AS C_CITY,
    c.C_NATION AS C_NATION,
    c.C_REGION AS C_REGION,
    c.C_PHONE AS C_PHONE,
    c.C_MKTSEGMENT AS C_MKTSEGMENT,
    s.S_NAME AS S_NAME,
    s.S_ADDRESS AS S_ADDRESS,
    s.S_CITY AS S_CITY,
    s.S_NATION AS S_NATION,
    s.S_REGION AS S_REGION,
    s.S_PHONE AS S_PHONE,
    p.P_NAME AS P_NAME,
    p.P_MFGR AS P_MFGR,
    p.P_CATEGORY AS P_CATEGORY,
    p.P_BRAND AS P_BRAND,
    p.P_COLOR AS P_COLOR,
    p.P_TYPE AS P_TYPE,
    p.P_SIZE AS P_SIZE,
    p.P_CONTAINER AS P_CONTAINER
FROM lineorder AS l
INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY;

--Q1.1

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toYear(LO_ORDERDATE) = 1993 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;

--Q1.2

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toYYYYMM(LO_ORDERDATE) = 199401 AND LO_DISCOUNT BETWEEN 4 AND 6 AND LO_QUANTITY BETWEEN 26 AND 35;

--Q1.3

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toISOWeek(LO_ORDERDATE) = 6 AND toYear(LO_ORDERDATE) = 1994 
  AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;

--Q2.1

SELECT
    sum(LO_REVENUE),
    toYear(LO_ORDERDATE) AS year,
    P_BRAND
FROM lineorder_flat
WHERE P_CATEGORY = 'MFGR#12' AND S_REGION = 'AMERICA'
GROUP BY
    year,
    P_BRAND
ORDER BY
    year,
    P_BRAND;

--Q2.2

SELECT
    sum(LO_REVENUE),
    toYear(LO_ORDERDATE) AS year,
    P_BRAND
FROM lineorder_flat
WHERE P_BRAND >= 'MFGR#2221' AND P_BRAND <= 'MFGR#2228' AND S_REGION = 'ASIA'
GROUP BY
    year,
    P_BRAND
ORDER BY
    year,
    P_BRAND;

--Q2.3

SELECT
    sum(LO_REVENUE),
    toYear(LO_ORDERDATE) AS year,
    P_BRAND
FROM lineorder_flat
WHERE P_BRAND = 'MFGR#2239' AND S_REGION = 'EUROPE'
GROUP BY
    year,
    P_BRAND
ORDER BY
    year,
    P_BRAND;

--Q3.1

SELECT
    C_NATION,
    S_NATION,
    toYear(LO_ORDERDATE) AS year,
    sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE C_REGION = 'ASIA' AND S_REGION = 'ASIA' AND year >= 1992 AND year <= 1997
GROUP BY
    C_NATION,
    S_NATION,
    year
ORDER BY
    year ASC,
    revenue DESC;

--Q3.2

SELECT
    C_CITY,
    S_CITY,
    toYear(LO_ORDERDATE) AS year,
    sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE C_NATION = 'UNITED STATES' AND S_NATION = 'UNITED STATES' AND year >= 1992 AND year <= 1997
GROUP BY
    C_CITY,
    S_CITY,
    year
ORDER BY
    year ASC,
    revenue DESC;

--Q3.3

SELECT
    C_CITY,
    S_CITY,
    toYear(LO_ORDERDATE) AS year,
    sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND year >= 1992 AND year <= 1997
GROUP BY
    C_CITY,
    S_CITY,
    year
ORDER BY
    year ASC,
    revenue DESC;

--Q3.4

SELECT
    C_CITY,
    S_CITY,
    toYear(LO_ORDERDATE) AS year,
    sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND toYYYYMM(LO_ORDERDATE) = 199712
GROUP BY
    C_CITY,
    S_CITY,
    year
ORDER BY
    year ASC,
    revenue DESC;

--Q4.1

SELECT
    toYear(LO_ORDERDATE) AS year,
    C_NATION,
    sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
    year,
    C_NATION
ORDER BY
    year ASC,
    C_NATION ASC;

--Q4.2

SELECT
    toYear(LO_ORDERDATE) AS year,
    S_NATION,
    P_CATEGORY,
    sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (year = 1997 OR year = 1998) AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
    year,
    S_NATION,
    P_CATEGORY
ORDER BY
    year ASC,
    S_NATION ASC,
    P_CATEGORY ASC;

--Q4.3

SELECT
    toYear(LO_ORDERDATE) AS year,
    S_CITY,
    P_BRAND,
    sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE S_NATION = 'UNITED STATES' AND (year = 1997 OR year = 1998) AND P_CATEGORY = 'MFGR#14'
GROUP BY
    year,
    S_CITY,
    P_BRAND
ORDER BY
    year ASC,
    S_CITY ASC,
    P_BRAND ASC;