Execute a SELECT statement to query data - ApsaraDB for ClickHouse

SELECT statements are used to query data from one or multiple tables. This topic describes the syntax of a SELECT statement and some clauses.

Syntax

In a SELECT statement, all clauses, except for the required list of expressions (expr_list), are optional. For more information, see SELECT Query.

Syntax:

SELECT [DISTINCT] expr_list
    [FROM [db.]table | (subquery) | table_function] [FINAL]
    [SAMPLE sample_coeff]
    [ARRAY JOIN ...]
    [GLOBAL] ANY|ALL INNER|LEFT JOIN (subquery)|table USING columns_list
    [PREWHERE expr]
    [WHERE expr]
    [GROUP BY expr_list] [WITH TOTALS]
    [HAVING expr]
    [ORDER BY expr_list]
    [LIMIT [n, ]m]
    [UNION ALL ...]
    [INTO OUTFILE filename]
    [FORMAT format]
    [LIMIT n BY columns]

Example:

SELECT
    OriginCityName,
    DestCityName,
    count(*) AS flights,
    bar(flights, 0, 20000, 40)
FROM ontime_distributed 
WHERE Year = 1988
GROUP BY OriginCityName, DestCityName 
ORDER BY flights DESC 
LIMIT 20;

SAMPLE clause

You can use the SAMPLE clause to perform approximated queries.

Important

The SAMPLE clause is supported only for *MergeTree tables for which sampling expressions are specified when the tables are created.

The syntax of the SAMPLE clause is SAMPLE k, where k can be a decimal number or a sufficiently large integer.

When k is a decimal number:
- The value range of k is 0 < k < 1.
- The query is executed on a sample from k fraction of data. For example, SAMPLE 0.1 indicates that the query retrieves 10% of the data.
- This setting can be used to obtain an approximate result from all the data. Calculation formula: Number of results that are obtained by using the SAMPLE clause/k.
When k is a sufficiently large integer:
- The query is executed on a sample of up to k rows. For example, SAMPLE 10000000 indicates that the query retrieves a maximum of 10,000,000 rows of data as sample data.
- This setting cannot be used to obtain an approximate result from all the data.

Example:

In the following example, the query retrieves 10% of the data.

The result of the count() function is the counting result of the 10% of the data. To obtain the approximated counting result of all data, the result of the count() function must be multiplied by 10.

SELECT
    Title,
    count() * 10 AS PageViews
FROM hits_distributed
SAMPLE 0.1
WHERE
    CounterID = 34
    AND toDate(EventDate) >= toDate('2013-01-29')
    AND toDate(EventDate) <= toDate('2013-02-04')
    AND NOT DontCountHits
    AND NOT Refresh
    AND Title != ''
GROUP BY Title
ORDER BY PageViews DESC LIMIT 1000;

ARRAY JOIN clause

The ARRAY JOIN clause is used to execute JOIN with an array or nested data structure. This clause is similar to the arrayJoin function but is applicable to more scenarios.

ARRAY JOIN with an array data structure: The data of an array column in a row is returned in multiple rows and each row has a column containing an individual array element of that initial column. This way, further query or aggregation operations can be performed on each array element of that initial column.
ARRAY JOIN with a nested data structure: The ARRAY JOIN clause can not only split an array, but can also retain the structural relationship of all the elements in the array. Therefore, you can query nested fields.

Example:

Create a test table named arrays_test.

CREATE TABLE arrays_test
(
    s String,
    arr Array(UInt8)
) ENGINE = Memory

Insert test data into the table.

INSERT INTO arrays_test VALUES ('Hello', [1,2]), ('World', [3,4,5]), ('Goodbye', [])

Query the arrays_test table.

SELECT * FROM arrays_test;

The following result is returned:

┌─s───────┬─arr─────┐
│ Hello   │ [1,2]   │
│ World   │ [3,4,5] │
│ Goodbye │ []      │
└─────────┴─────────┘

Use the ARRAY JOIN clause to perform a query on the arrays_test table.

SELECT s, arr FROM arrays_test ARRAY JOIN arr;

The following result is returned:

┌─s─────┬─arr─┐
│ Hello │   1 │
│ Hello │   2 │
│ World │   3 │
│ World │   4 │
│ World │   5 │
└───────┴─────┘

JOIN clause

The JOIN clause combines rows from two or more tables based on a specific join condition to create a more complex and useful dataset.

For more information about how to handle NULL values in a JOIN clause, see join_use_nulls, Nullable, and Null.
When you perform the JOIN or IN subquery on multiple distributed tables in the multi-node cluster of Community-compatible Edition, the following error message may appear: Exception: Double-distributed IN/JOIN subqueries is denied (distributed_product_mode = 'deny'). We recommend that you do not perform the JOIN or IN subquery on distributed tables after you create the tables. For more information about how to perform subqueries, see FAQ.

WHERE clause

The WHERE clause is used to filter data.

If you use a WHERE clause, it must contain an expression of the UINT8 type. The expression is usually a conditional expression.
If your table engine supports indexes, the WHERE expression is evaluated on the ability to use indexes.

PREWHERE clause

The PREWHERE clause is similar to the WHERE clause and is also used to filter data. For more information, see PREWHERE Clause.

Note

The PREWHERE clause is supported only by engines in the MergeTree family.
If you use both the PREWHERE clause and the WHERE clause in a SELECT statement, the PREWHERE clause precedes the WHERE clause.
Indexed fields are not suitable for the PREWHERE clause. If a field is indexed, only the data block that meets the index is read.
Assume that you set the optimize_move_to_prewhere parameter to 1 and do not use the PREWHERE clause in a SELECT statement. The system automatically checks whether a part of the filter condition in the WHERE clause is suitable for the PREWHERE clause. If so, the system moves the part from the WHERE clause to the PREWHERE clause.

Application scenario

When filter conditions that provide strong data filtration involve a few fields that are suitable for regular filtering but not index filtering, you can use the PREWHERE clause to reduce the volume of data to read. For example, when you need to query a large number of fields at a time, we recommend that you use the PREWHERE clause and specify a few of the fields in the filter condition.

WITH TOTALS modifier

If you specify the WITH TOTALS modifier in a GROUP BY clause, the query results contain an extra row of calculated values.

Note

The extra row has key columns containing default values (zeros or empty lines) and columns of aggregate functions with the values calculated across all the rows.
In query results, this row is separate from the other rows only when the output format is JSON, TabSeparated, or Pretty.

Example:

Create a test table named t1.

CREATE TABLE default.t1
(
    `a` Int32,
    `b` Int32,
    `c` String,
    `d` Date
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY (a, b)
SETTINGS index_granularity = 8192;

Insert test data into the table.

INSERT INTO t3(a, b, c, d) INSERT INTO t1(a, b, c, d) VALUES (3,4,'a','2022-09-01'),(12,7,'b','2022-07-01'),(21,4,'c','2022-05-01'),(11,1,'d','2022-06-01');

Query data in the test table t1, group the data based on column d, and return the result of grouping and summarizing of data in column a.

select sum(a) from t1 group by d;

The following result is returned:

┌─sum(a)─┐
│   21   │
│   11   │
│   12   │
│   3   │
└───┴────┘

Query the data in the test table t1. Use the WITH TOTALS modifier to group the data based on column d and return the result of grouping and summarizing of data in column a.

 select sum(a) from t1 group by d with totals;

The following result is returned:

┌─sum(a)─┐
│   21   │
│   11   │
│   12   │
│   3   │
└───┴────┘
Totals:
┌─sum(a)─┐
│   47   │
└───┴────┘