All Products
Search
Document Center

ApsaraDB for ClickHouse:SELECT

Last Updated:Jun 08, 2023

This topic describes how to execute a SELECT statement to query data in ApsaraDB for ClickHouse.

Syntax

The following sample code shows the syntax of a SELECT statement:

SELECT [DISTINCT] expr_list
    [FROM [db.]table | (subquery) | table_function] [FINAL]
    [SAMPLE sample_coeff]
    [ARRAY JOIN ...]
    [GLOBAL] ANY|ALL INNER|LEFT JOIN (subquery)|table USING columns_list
    [PREWHERE expr]
    [WHERE expr]
    [GROUP BY expr_list] [WITH TOTALS]
    [HAVING expr]
    [ORDER BY expr_list]
    [LIMIT [n, ]m]
    [UNION ALL ...]
    [INTO OUTFILE filename]
    [FORMAT format]
    [LIMIT n BY columns]

All clauses are optional, except for the required list of expressions (expr_list) immediately after SELECT. The following sections of this topic describe some of the clauses. For more information about the clauses, see SELECT Query.

The following sample code provides a simple SELECT query statement:

SELECT
    OriginCityName,
    DestCityName,
    count(*) AS flights,
    bar(flights, 0, 20000, 40)
FROM ontime_distributed 
WHERE Year = 1988
GROUP BY OriginCityName, DestCityName 
ORDER BY flights DESC 
LIMIT 20;

SAMPLE clause

You can use the SAMPLE clause in SELECT statements to perform approximated queries. Approximated queries are supported only for MergeTree tables for which sampling expressions are specified when the tables are created.

The syntax of the SAMPLE clause is SAMPLE k, where k can be a decimal number from 0 to 1 or a large positive integer.

When k is a decimal number from 0 to 1, the query uses k as a percentage to select sample data. For example, if the SAMPLE clause is SAMPLE 0.1, the query retrieves 10% of the data. When k is a large positive integer, the query uses k as the maximum number of rows of sample data. For example, if the SAMPLE clause is SAMPLE 10000000, the query retrieves a minimum of 10,000,000 rows of data as sample data.

Example:

SELECT
    Title,
    count() * 10 AS PageViews
FROM hits_distributed
SAMPLE 0.1
WHERE
    CounterID = 34
    AND toDate(EventDate) >= toDate('2013-01-29')
    AND toDate(EventDate) <= toDate('2013-02-04')
    AND NOT DontCountHits
    AND NOT Refresh
    AND Title != ''
GROUP BY Title
ORDER BY PageViews DESC LIMIT 1000;

In this example, k is 0.1 and the query retrieves 10% of the data. In this case, the result of the count() function is the counting result of the 10% of the data. To obtain the counting result of all data, the result of the count() function must be multiplied by 10.

When you perform an approximated query by specifying a large positive integer in the SAMPLE clause, for example, SAMPLE 10000000, you cannot determine the proportion of samples in full data. In this case, you cannot obtain an approximated counting result of full data by multiplying the counting result of the samples. Therefore, a large positive integer may not be a suitable value for k in the SAMPLE clause.

If all the source data of tables is available and you use the same sampling expression when you create the tables, the query results of the tables are the same. In other words, the SAMPLE clause samples data the same way regardless of the time when the query is performed, the server on which the query is performed, and the table to be queried. If you use the same sample rate to perform approximated queries, the query results are the same.

For example, the user IDs that you obtain by using the SAMPLE clause can be the same as those that you obtain by executing a SELECT statement without the SAMPLE clause. This means that you can use the SAMPLE clause in IN subqueries or use sampling results in other queries.

ARRAY JOIN clause

The ARRAY JOIN clause is used in queries to join arrays or nested data structures. This clause is similar to the Array.join() function but is applicable to more scenarios.

The ARRAY JOIN clause is essentially equivalent to an INNER JOIN array.

Example:

Create a test table named arrays_test and insert data into the table.

CREATE TABLE arrays_test
(
    s String,
    arr Array(UInt8)
) ENGINE = Memory
INSERT INTO arrays_test VALUES ('Hello', [1,2]), ('World', [3,4,5]), ('Goodbye', [])

Query the arrays_test table.

SELECT * FROM arrays_test;

The following result is returned:

┌─s───────┬─arr─────┐
│ Hello   │ [1,2]   │
│ World   │ [3,4,5] │
│ Goodbye │ []      │
└─────────┴─────────┘

Use the ARRAY JOIN clause to perform a query on the arrays_test table.

SELECT s, arr FROM arrays_test ARRAY JOIN arr;

The following result is returned:

┌─s─────┬─arr─┐
│ Hello │   1 │
│ Hello │   2 │
│ World │   3 │
│ World │   4 │
│ World │   5 │
└───────┴─────┘

NULL values in the JOIN clause

For information about how to process NULL values in the JOIN clause, see the join_use_nulls section of the Settings topic, the Nullable topic, and the NULL section of the Syntax topic of the open source ClickHouse documentation.

WHERE clause

The WHERE clause is used to filter data. You must specify an expression of the UINT8 type in the WHERE clause. The expression is usually a conditional expression.

If your table engine supports indexes, the engine checks whether the expression contains indexes.

PREWHERE clause

The PREWHERE clause is supported only by engines in the MergeTree family. The PREWHERE clause is similar to the WHERE clause. The difference lies in the way of reading data from the queried table. When you use the PREWHERE clause in a SELECT statement, the fields that are specified in the PREWHERE clause are first read from the queried table. Then, based on the results of the PREWHERE clause, other required fields are read from the queried table.

The filter condition may contain a few fields that are suitable for regular filtering but not index filtering. In this case, the PREWHERE clause is helpful because it can facilitate data reading. For example, when you need to query a large number of fields at a time, we recommend that you use the PREWHERE clause and specify a few of the fields in the filter condition.

Note
  • If you use both the PREWHERE clause and the WHERE clause in a SELECT statement, the PREWHERE clause is executed before the WHERE clause.

  • Indexed fields are not suitable for the PREWHERE clause. If a field is indexed, only the data block that meets the index is read.

  • Assume that you set the optimize_move_to_prewhere parameter to 1 and do not use the PREWHERE clause in a SELECT statement. The system automatically checks whether a part of the filter condition in the WHERE clause is suitable for the PREWHERE clause. If so, the system moves the part from the WHERE clause to the PREWHERE clause.

WITH TOTALS modifier

If you specify the WITH TOTALS modifier in a SELECT statement, the query results contain an extra row of calculated values. The extra row contains the default values of all the keys. The default values are zero or empty values. The extra row also contains all the aggregation results of all the queried data rows.

In query results, this row is separate from the other rows only when the output format is JSON, TabSeparated, or Pretty.

Example:

Create a test table t1 and insert data into the table.

CREATE TABLE default.t1
(
    `a` Int32,
    `b` Int32,
    `c` String,
    `d` Date
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY (a, b)
SETTINGS index_granularity = 8192;
INSERT INTO t3(a, b, c, d) INSERT INTO t1(a, b, c, d) VALUES (3,4,'a','2022-09-01'),(12,7,'b','2022-07-01'),(21,4,'c','2022-05-01'),(11,1,'d','2022-06-01');

Group data in the test table t1 based on column d and return the result of grouping and summarizing of data in column a.

select sum(a) from t1 group by d;

The following result is returned:

┌─sum(a)─┐
│   21   │
│   11   │
│   12   │
│   3    │
└────────┘

Use the WITH TOTALS modifier to group the data in the test table t1 based on column d and return the result of grouping and summarizing of data in column a.

 select sum(a) from t1 group by d with totals;

The following result is returned:

┌─sum(a)─┐
│   21   │
│   11   │
│   12   │
│   3    │
└────────┘
Totals:
┌─sum(a)─┐
│   47   │
└────────┘