All Products
Search
Document Center

MaxCompute:CLUSTER_SAMPLE

Last Updated:Aug 09, 2023

Samples random rows of data. If true is returned, the specified row of data is sampled.

Limits

Before you use window functions, take note of the following limits:

  • Window functions are supported only in SELECT statements.

  • A window function cannot contain nested window functions or nested aggregate functions.

  • You cannot use window functions together with aggregate functions of the same level.

Syntax

boolean cluster_sample(bigint <N>) OVER ([partition_clause])
boolean cluster_sample(bigint <N>, bigint <M>) OVER ([partition_clause])

Description

  • cluster_sample(bigint <N>): specifies that N random rows of data are sampled.

  • cluster_sample(bigint <N>, bigint <M>): specifies that rows are sampled based on a specified ratio (M/N). The number of rows that are sampled is calculated by using the following formula: partition_row_count × M/N. partition_row_count specifies the number of rows in a partition.

Parameters

  • N: required. A constant of the BIGINT type. If N is set to NULL, NULL is returned.

  • M: required. A constant of the BIGINT type. If M is set to NULL, NULL is returned.

  • partition_clause: optional. For more information, see frame_clause.

Return value

The return value is a value of the BOOLEAN type.

Sample data

This section provides sample source data and examples for you to understand how to use the functions. Create a table named emp and insert the sample data into the table. Sample statement:

create table if not exists emp
   (empno bigint,
    ename string,
    job string,
    mgr bigint,
    hiredate datetime,
    sal bigint,
    comm bigint,
    deptno bigint);
tunnel upload emp.txt emp; -- Replace emp.txt with the actual path (path and name) to which you upload the data file.

The emp.txt file contains the following sample data:

7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10

Examples

  • Sample five data entries. Sample statement:

    select deptno, sal    
    		from (        
          select deptno, sal, cluster_sample(5) over () as flag
          from emp
          ) sub 
         where flag = true;

    The following result is returned:

    +------------+------------+
    | deptno     | sal        |
    +------------+------------+
    | 30         | 2850       |
    | 30         | 1500       |
    | 20         | 3000       |
    | 10         | 2450       |
    | 10         | 1300       |
    +------------+------------+
  • Sample about 20% of data entries. Sample statement:

    select deptno, sal
        from (
            select deptno, sal, cluster_sample(5, 1) over () as flag
            from emp
            ) sub
        where flag = true;

    The following result is returned:

    +------------+------------+
    | deptno     | sal        |
    +------------+------------+
    | 10         | 2450       |
    | 20         | 3000       |
    | 20         | 3000       |
    +------------+------------+
  • Sample about 20% of data entries in each group. Sample statement:

    select deptno, sal
        from (
            select deptno, sal, cluster_sample(5, 1) over (partition by deptno) as flag
            from emp
            ) sub
        where flag = true;

    The following result is returned:

    +------------+------------+
    | deptno     | sal        |
    +------------+------------+
    | 10         | 1300       |
    | 20         | 2975       |
    | 30         | 950        |
    +------------+------------+

Related functions

CLUSTER_SAMPLE is a window function. For more information about functions related to sum calculation and data sorting in a specified window column, see Window functions.