SAMPLE - MaxCompute - Alibaba Cloud Documentation Center

Samples all values that are read from column_name based on x and y, and filters out the rows that do not meet sampling conditions.

Syntax

boolean sample(<x>, <y>, [<column_name1>, <column_name2>[,...]])

Parameters

x and y: x is required. x and y are integer constants that are greater than 0. Their values are of the BIGINT type. The two parameters indicate that the values fall into x portions based on the hash function and the yth portion is used.
y is optional. If no value is specified for the y parameter, the first portion is used, and you do not need to specify column_name.
If x or y is of another data type, the value of x or y is less than or equal to 0, or y is greater than x, an error is returned. If the value of x or y is null, null is returned.
column_name: optional. The name of the column on which sampling is performed. If no value is specified for this parameter, random sampling is performed based on the values of x and y. The name of the column can be of any data type, and the column value can be null. Implicit conversions are not performed. If column_name is set to null, an error is returned.
Note
- To prevent data skew due to the null value, uniform hashing is performed on the null values in column_name in x portions. If no value is specified for the column_name parameter and the amount of data is small, the output is not necessarily uniform. In this case, we recommend that you specify a value for column_name to obtain a uniform output.
- Random sampling can be performed only on columns of the following data types: BIGINT, DATETIME, BOOLEAN, DOUBLE, STRING, BINARY, CHAR, and VARCHAR.

Return value

A value of the BOOLEAN type is returned.

Examples:

In this example, the mf_sample table is used. This table contains the following data:

+------------+------+------+------+------+
| id         | col1 | col2 | col3 | col4 |
+------------+------+------+------+------+
| 3          | eee  | rrr  | tttt | ggggg |
| 4          | yyy  | uuuu | iiiii | ccccccc |
| 1          | "abc" | "bcd" | "rthg" | "ahgjeog" |
| 2          | "a1bc" | "bc1d" | "rt1hg" | "ahgjeog" |
+------------+------+------+------+------+

Randomly hash the data in the table to two portions, and take the first portion.

select * from mf_sample where sample(2,1);

The following result is returned:

+------------+------+------+------+------+
| id         | col1 | col2 | col3 | col4 |
+------------+------+------+------+------+
| 3          | eee  | rrr  | tttt | ggggg |
| 1          | "abc" | "bcd" | "rthg" | "ahgjeog" |
+------------+------+------+------+------+

Randomly hash the values in the id column of the table to two portions, and take the first portion.

select * from mf_sample where sample(2,1,id);

The following result is returned:

+------------+------+------+------+------+
| id         | col1 | col2 | col3 | col4 |
+------------+------+------+------+------+
| 4          | yyy  | uuuu | iiiii | ccccccc |
| 2          | "a1bc" | "bc1d" | "rt1hg" | "ahgjeog" |
+------------+------+------+------+------+

Related functions

For more information, see Other functions.