All Products
Search
Document Center

MaxCompute:SAMPLE

Last Updated:Aug 23, 2023

Samples all values that are read from column_name based on x and y, and filters out the rows that do not meet sampling conditions.

Syntax

boolean sample(<x>, <y>, [<column_name1>, <column_name2>[,...]])

Parameters

  • x and y: x is required. x and y are integer constants that are greater than 0. Their values are of the BIGINT type. The two parameters indicate that the values fall into x portions based on the hash function and the yth portion is used.

    y is optional. If no value is specified for the y parameter, the first portion is used, and you do not need to specify column_name.

    If x or y is of another data type, the value of x or y is less than or equal to 0, or y is greater than x, an error is returned. If the value of x or y is null, null is returned.

  • column_name: optional. The name of the column on which sampling is performed. If no value is specified for this parameter, random sampling is performed based on the values of x and y. The name of the column can be of any data type, and the column value can be null. Implicit conversions are not performed. If column_name is set to null, an error is returned.

    Note
    • To prevent data skew due to the null value, uniform hashing is performed on the null values in column_name in x portions. If no value is specified for the column_name parameter and the amount of data is small, the output is not necessarily uniform. In this case, we recommend that you specify a value for column_name to obtain a uniform output.

    • Random sampling can be performed only on columns of the following data types: BIGINT, DATETIME, BOOLEAN, DOUBLE, STRING, BINARY, CHAR, and VARCHAR.

Return value

A value of the BOOLEAN type is returned.

Examples:

In this example, the mf_sample table is used. This table contains the following data:

+------------+------+------+------+------+
| id         | col1 | col2 | col3 | col4 |
+------------+------+------+------+------+
| 3          | eee  | rrr  | tttt | ggggg |
| 4          | yyy  | uuuu | iiiii | ccccccc |
| 1          | "abc" | "bcd" | "rthg" | "ahgjeog" |
| 2          | "a1bc" | "bc1d" | "rt1hg" | "ahgjeog" |
+------------+------+------+------+------+
  • Randomly hash the data in the table to two portions, and take the first portion.

    select * from mf_sample where sample(2,1);

    The following result is returned:

    +------------+------+------+------+------+
    | id         | col1 | col2 | col3 | col4 |
    +------------+------+------+------+------+
    | 3          | eee  | rrr  | tttt | ggggg |
    | 1          | "abc" | "bcd" | "rthg" | "ahgjeog" |
    +------------+------+------+------+------+
  • Randomly hash the values in the id column of the table to two portions, and take the first portion.

    select * from mf_sample where sample(2,1,id);

    The following result is returned:

    +------------+------+------+------+------+
    | id         | col1 | col2 | col3 | col4 |
    +------------+------+------+------+------+
    | 4          | yyy  | uuuu | iiiii | ccccccc |
    | 2          | "a1bc" | "bc1d" | "rt1hg" | "ahgjeog" |
    +------------+------+------+------+------+

Related functions

For more information, see Other functions.