This topic describes how to execute a SELECT statement to query data in ApsaraDB for ClickHouse.
The following sample code shows the syntax of a SELECT statement:
SELECT [DISTINCT] expr_list [FROM [db.]table | (subquery) | table_function] [FINAL] [SAMPLE sample_coeff] [ARRAY JOIN ...] [GLOBAL] ANY|ALL INNER|LEFT JOIN (subquery)|table USING columns_list [PREWHERE expr] [WHERE expr] [GROUP BY expr_list] [WITH TOTALS] [HAVING expr] [ORDER BY expr_list] [LIMIT [n, ]m] [UNION ALL ...] [INTO OUTFILE filename] [FORMAT format] [LIMIT n BY columns]
All clauses are optional, except for the required list of expressions (expr_list) after SELECT. The following sections of this topic describe some of the clauses. For more information about the clauses, see the SELECT Query topic of the open source ClickHouse documentation.
The following sample code provides a simple SELECT query statement:
SELECT OriginCityName, DestCityName, count(*) AS flights, bar(flights, 0, 20000, 40) FROM ontime_distributed WHERE Year = 1988 GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20;
You can use the SAMPLE clause in SELECT statements to perform approximate queries. Approximate queries are supported only for MergeTree tables. To perform approximate queries on a table, you must specify a sampling expression when you create the table.
The syntax of the SAMPLE clause is SAMPLE k, where k can be a decimal number from 0 to 1 or a large positive integer.
When k is a decimal number from 0 to 1, the query uses k as a percentage to select sample data. For example, if the SAMPLE clause is SAMPLE 0.1, the query retrieves 10% of the data. When k is a large positive integer, the query uses k as the maximum number of rows of sample data. For example, if the SAMPLE clause is SAMPLE 10000000, the query retrieves up to 10,000,000 rows of data as sample data.
The following sample code provides examples on how to manage resource queues:
SELECT Title, count() * 10 AS PageViews FROM hits_distributed SAMPLE 0.1 WHERE CounterID = 34 AND toDate(EventDate) >= toDate('2013-01-29') AND toDate(EventDate) <= toDate('2013-02-04') AND NOT DontCountHits AND NOT Refresh AND Title != '' GROUP BY Title ORDER BY PageViews DESC LIMIT 1000;
In this example, k is 0.1 and the query retrieves 10% of the data. However, the query does not automatically verify the aggregation results. To ensure the precision of query results, the result of the count() function is multiplied by 10.
When you perform an approximate query by specifying a large positive integer in the SAMPLE clause, neither information about which data is processed nor the number by which the result of the count() function needs to be multiplied is specified. Therefore, in some scenarios, a large positive integer may not be a suitable value for k in the SAMPLE clause.
The SAMPLE clause consistently works for tables. If all the data of a table can be queried and you use the same sample rate to perform approximate queries, the query results are the same. The condition is that the sampling expression of the table remains unchanged. In other words, the SAMPLE clause samples data the same way regardless of the time when the query is performed, the server on which the query is performed, and the table to be queried.
For example, the user IDs that you obtain by using the SAMPLE clause can be the same as those that you obtain by executing a SELECT statement without the SAMPLE clause. This means that you can use the SAMPLE clause in IN subqueries or use sampling results to associate with other queries.
ARRAY JOIN clause
The ARRAY JOIN clause is used in queries to join arrays and nested data structures. This clause is similar to the Array.join() function but has wider use.
The ARRAY JOIN clause is essentially equivalent to an INNER JOIN array. The following sample code provides an example:
:) CREATE TABLE arrays_test (s String, arr Array(UInt8)) ENGINE = Memory CREATE TABLE arrays_test ( s String, arr Array(UInt8) ) ENGINE = Memory Ok. 0 rows in set. Elapsed: 0.001 sec. :) INSERT INTO arrays_test VALUES ('Hello', [1,2]), ('World', [3,4,5]), ('Goodbye', ) INSERT INTO arrays_test VALUES Ok. 3 rows in set. Elapsed: 0.001 sec. :) SELECT * FROM arrays_test SELECT * FROM arrays_test ┌─s───────┬─arr─────┐ │ Hello │ [1,2] │ │ World │ [3,4,5] │ │ Goodbye │  │ └─────────┴─────────┘ 3 rows in set. Elapsed: 0.001 sec. :) SELECT s, arr FROM arrays_test ARRAY JOIN arr SELECT s, arr FROM arrays_test ARRAY JOIN arr ┌─s─────┬─arr─┐ │ Hello │ 1 │ │ Hello │ 2 │ │ World │ 3 │ │ World │ 4 │ │ World │ 5 │ └───────┴─────┘ 5 rows in set. Elapsed: 0.001 sec.
For information about how to process NULL values in the JOIN clause, see the join_use_nulls section of the Settings topic, the Nullable topic, and the NULL section of the Syntax topic of the open source ClickHouse documentation.
If you use the WHERE clause in a SELECT statement, you must specify an expression of the UINT8 type in the WHERE clause. In most cases, the expression is a logical comparison expression. The expression is used to filter data before the data is converted.
If your table engine supports indexes, the engine checks whether the expression contains indexes.
The PREWHERE clause is similar to the WHERE clause. The difference lies in the way of reading data from the queried table. When you use the PREWHERE clause in a SELECT statement, the fields that are specified in the PREWHERE clause are first read from the queried table. Then, based on the results of the PREWHERE clause, other required fields are read from the queried table.
The filter condition may contain a few fields that are suitable for regular filtering but not index filtering. In this case, the PREWHERE clause is helpful because it can facilitate data reading.
Therefore, when you need to query a large number of fields at a time, we recommend that you use the PREWHERE clause and specify a few of the fields in the filter condition.
WITH TOTALS modifier
If you specify the WITH TOTALS modifier in a SELECT statement, the query results contain an extra row of calculated values. The extra row contains the default values of all the keys. The default values are zero or empty values. The extra row also contains all the aggregation results of all the queried data rows.
In query results, this row is separate from the other rows only when the output format is JSON, TabSeparated, or Pretty.