This topic describes how to execute a SELECT statement to query data in ApsaraDB for ClickHouse.

Syntax

The following sample code shows the syntax of a SELECT statement:

SELECT [DISTINCT] expr_list
    [FROM [db.]table | (subquery) | table_function] [FINAL]
    [SAMPLE sample_coeff]
    [ARRAY JOIN ...]
    [GLOBAL] ANY|ALL INNER|LEFT JOIN (subquery)|table USING columns_list
    [PREWHERE expr]
    [WHERE expr]
    [GROUP BY expr_list] [WITH TOTALS]
    [HAVING expr]
    [ORDER BY expr_list]
    [LIMIT [n, ]m]
    [UNION ALL ...]
    [INTO OUTFILE filename]
    [FORMAT format]
    [LIMIT n BY columns]

All clauses are optional, except for the required list of expressions (expr_list) after SELECT. The following sections of this topic describe some of the clauses. For more information about the clauses, see the SELECT Query topic of the open source ClickHouse documentation.

The following sample code provides a simple SELECT query statement:

SELECT
    OriginCityName,
    DestCityName,
    count(*) AS flights,
    bar(flights, 0, 20000, 40)
FROM ontime_distributed 
WHERE Year = 1988
GROUP BY OriginCityName, DestCityName 
ORDER BY flights DESC 
LIMIT 20;

SAMPLE clause

You can use the SAMPLE clause in SELECT statements to perform approximate queries. Approximate queries are supported only for MergeTree tables. To perform approximate queries on a table, you must specify a sampling expression when you create the table.

The syntax of the SAMPLE clause is SAMPLE k, where k can be a decimal number from 0 to 1 or a large positive integer.

When k is a decimal number from 0 to 1, the query uses k as a percentage to select sample data. For example, if the SAMPLE clause is SAMPLE 0.1, the query retrieves 10% of the data. When k is a large positive integer, the query uses k as the maximum number of rows of sample data. For example, if the SAMPLE clause is SAMPLE 10000000, the query retrieves up to 10,000,000 rows of data as sample data.

The following sample code provides examples on how to manage resource queues:

SELECT
    Title,
    count() * 10 AS PageViews
FROM hits_distributed
SAMPLE 0.1
WHERE
    CounterID = 34
    AND toDate(EventDate) >= toDate('2013-01-29')
    AND toDate(EventDate) <= toDate('2013-02-04')
    AND NOT DontCountHits
    AND NOT Refresh
    AND Title != ''
GROUP BY Title
ORDER BY PageViews DESC LIMIT 1000;

In this example, k is 0.1 and the query retrieves 10% of the data. However, the query does not automatically verify the aggregation results. To ensure the precision of query results, the result of the count() function is multiplied by 10.

When you perform an approximate query by specifying a large positive integer in the SAMPLE clause, neither information about which data is processed nor the number by which the result of the count() function needs to be multiplied is specified. Therefore, in some scenarios, a large positive integer may not be a suitable value for k in the SAMPLE clause.

The SAMPLE clause consistently works for tables. If all the data of a table can be queried and you use the same sample rate to perform approximate queries, the query results are the same. The condition is that the sampling expression of the table remains unchanged. In other words, the SAMPLE clause samples data the same way regardless of the time when the query is performed, the server on which the query is performed, and the table to be queried.

For example, the user IDs that you obtain by using the SAMPLE clause can be the same as those that you obtain by executing a SELECT statement without the SAMPLE clause. This means that you can use the SAMPLE clause in IN subqueries or use sampling results to associate with other queries.

ARRAY JOIN clause

The ARRAY JOIN clause is used in queries to join arrays and nested data structures. This clause is similar to the Array.join() function but has wider use.

The ARRAY JOIN clause is essentially equivalent to an INNER JOIN array. The following sample code provides an example:

:) CREATE TABLE arrays_test (s String, arr Array(UInt8)) ENGINE = Memory

CREATE TABLE arrays_test
(
    s String,
    arr Array(UInt8)
) ENGINE = Memory

Ok.

0 rows in set. Elapsed: 0.001 sec.

:) INSERT INTO arrays_test VALUES ('Hello', [1,2]), ('World', [3,4,5]), ('Goodbye', [])

INSERT INTO arrays_test VALUES

Ok.

3 rows in set. Elapsed: 0.001 sec.

:) SELECT * FROM arrays_test

SELECT *
FROM arrays_test

┌─s───────┬─arr─────┐
│ Hello   │ [1,2]   │
│ World   │ [3,4,5] │
│ Goodbye │ []      │
└─────────┴─────────┘

3 rows in set. Elapsed: 0.001 sec.

:) SELECT s, arr FROM arrays_test ARRAY JOIN arr

SELECT s, arr
FROM arrays_test
ARRAY JOIN arr

┌─s─────┬─arr─┐
│ Hello │   1 │
│ Hello │   2 │
│ World │   3 │
│ World │   4 │
│ World │   5 │
└───────┴─────┘

5 rows in set. Elapsed: 0.001 sec.

JOIN clause

For information about how to process NULL values in the JOIN clause, see the join_use_nulls section of the Settings topic, the Nullable topic, and the NULL section of the Syntax topic of the open source ClickHouse documentation.

WHERE clause

If you use the WHERE clause in a SELECT statement, you must specify an expression of the UINT8 type in the WHERE clause. In most cases, the expression is a logical comparison expression. The expression is used to filter data before the data is converted.

If your table engine supports indexes, the engine checks whether the expression contains indexes.

PREWHERE clause

The PREWHERE clause is similar to the WHERE clause. The difference lies in the way of reading data from the queried table. When you use the PREWHERE clause in a SELECT statement, the fields that are specified in the PREWHERE clause are first read from the queried table. Then, based on the results of the PREWHERE clause, other required fields are read from the queried table.

The filter condition may contain a few fields that are suitable for regular filtering but not index filtering. In this case, the PREWHERE clause is helpful because it can facilitate data reading.

Therefore, when you need to query a large number of fields at a time, we recommend that you use the PREWHERE clause and specify a few of the fields in the filter condition.

Note The PREWHERE clause is supported only by engines in the MergeTree family. If you use both the PREWHERE clause and the WHERE clause in a SELECT statement, the PREWHERE clause is executed before the WHERE clause. Indexed fields are not suitable for the PREWHERE clause. If a field is indexed, only the data block that meets the index is read. Assume that you set the optimize_move_to_prewhere parameter to 1 and do not use the PREWHERE clause in a SELECT statement. The system automatically checks whether a part of the filter condition in the WHERE clause is suitable for the PREWHERE clause. If so, the system moves the part from the WHERE clause to the PREWHERE clause.

WITH TOTALS modifier

If you specify the WITH TOTALS modifier in a SELECT statement, the query results contain an extra row of calculated values. The extra row contains the default values of all the keys. The default values are zero or empty values. The extra row also contains all the aggregation results of all the queried data rows.

In query results, this row is separate from the other rows only when the output format is JSON, TabSeparated, or Pretty.