All Products
Search
Document Center

MaxCompute:TABLESAMPLE

Last Updated:Sep 21, 2023

You can use the TABLESAMPLE feature provided by MaxCompute to sample table data. This feature provides three sampling methods: bucket-based sampling, percentage-based sampling, and random sampling. If you use percentage-based sampling, records are returned based on a specified percentage. If you use random sampling, a specified number of random records are returned. This topic describes the syntax of TABLESAMPLE and provides examples on how to use TABLESAMPLE.

Syntax

  • Bucket-based sampling

    TABLESAMPLE (BUCKET <x> OUT OF <y> [ON <col_name> | rand()])

    Parameter description

    • x and y: required. Data in the source table is divided into y buckets that are numbered from 1. Data in the x-th bucket is sampled.

    • col_name: the names of the columns whose data you want to sample. If the table is not a clustered table, the col_name or rand() function must be used. If the rand() function is used, input data is divided into buckets at random. You can specify a maximum of 10 columns in the ON clause.

  • Percentage-based sampling

    TABLESAMPLE (<n> PERCENT)

    In the preceding syntax, n specifies a sampling percentage value. If percentage-based sampling is used, n% of data is sampled. This indicates that the percentage of returned records to all records in the source table is about n%. The percentage value is not absolutely accurate.

  • Random sampling

    TABLESAMPLE (<m> ROWS)

    m specifies the number of records that are returned. If the total number of records in the source table is less than the value of m, all records in the source table are returned. The maximum value of m is 10000.

Sample data

The following sample tables are used:

  • BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020

    The BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 table is a table in MaxCompute public datasets. For more information about public datasets, see Overview.

  • tblsample_test

    The tblsample_test table is a clustered table. The following CREATE TABLE and INSERT OVERWRITE statements are used to create the table and insert data into the table.

    -- Create the tblsample_test table.
    CREATE TABLE tblsample_test(a bigint, b string, c string) 
      CLUSTERED BY (a, c) SORTED by (a, c) INTO 32 BUCKETS;
    -- Insert data into the table.
    insert overwrite table tblsample_test values(1,"b1","c1"),
        (2,"b2","c2"),
        (3,"b3","c3"),
        (4,"b4","c4");
    -- Query data from the table.
    select * from tblsample_test;
    -- The following result is returned:
    +------------+---+-----+
    | a          | b | c   |
    +------------+---+-----+
    | 2          | b2 | c2 |
    | 4          | b4 | c4 |
    | 3          | b3 | c3 |
    | 1          | b1 | c1 |
    +------------+---+---+

Examples

  • Example 1: Divide input data into buckets based on column values and sample data from a specified bucket.

    -- Sample data from the BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 table.
    SELECT isp_code,
    			 phoneno,
    			 province
      FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 
    TABLESAMPLE (BUCKET 1 OUT of 1000000 ON isp_code, 
                          									phoneno, 
                          									province) s;
                                            
    -- The following result is returned:
    +----------+---------+----------+
    | isp_code | phoneno | province |
    +----------+---------+----------+
    | 185      | 1853500 | Shanxi   |
    | 187      | 1878332 | Sichuan   |
    +----------+---------+----------+

  • Example 2: Use the RAND() function to divide input data into buckets at random and sample data from a specified bucket.

    -- Sample data from the BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 table.
    SELECT isp_code,
    			 phoneno,
           province
      FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020
    TABLESAMPLE (BUCKET 3 OUT OF 500000 ON RAND()) s;
    
    -- The following result is returned:
    +----------+---------+----------+
    | isp_code | phoneno | province |
    +----------+---------+----------+
    | 131      | 1312224 | Shanghai   |
    | 135      | 1353936 | Guangdong   |
    | 158      | 1586377 | Shandong   |
    +----------+---------+----------+

  • Example 3: Sample data from the tblsample_test table without specifying the ON clause.

    select a, b from tblsample_test TABLESAMPLE (BUCKET 1 OUT OF 2) as ts;
    
    -- The following result is returned:
    +------------+---+
    | a          | b |
    +------------+---+
    | 2          | b2|
    +------------+---+
    Note

    Data in the clustered table is stored in buckets. Therefore, data is directly retrieved from a bucket during sampling. This way, sampling performance is improved. The tblsample_test table is divided into 32 buckets when the table is created. When you sample data, sampling performance can be improved if the value of y is a multiple of 32, such as 32, 64, or 128 or 32 is divisible by the value of y, such as 16, 8, or 4.

  • Example 4: Sample n% of data from the tblsample_test table. In this example, the percentage of sampled records to all records in the source table is about n%. The percentage value is not absolutely accurate.

    -- Sample 50% of data from the tblsample_test table.
    SELECT * FROM tblsample_test TABLESAMPLE (50 PERCENT) s;
    
    -- The following result is returned:
    +------------+---+---+
    | a          | b | c |
    +------------+---+---+
    | 2          | b2 | c2 |
    | 3          | b3 | c3 |
    +------------+---+---+

  • Example 5: Sample n records from the tblsample_test table.

    select * FROM tblsample_test TABLESAMPLE (2 ROWS);
    
    -- The following result is returned:
    +------------+---+---+
    | a          | b | c |
    +------------+---+---+
    | 2          | b2 | c2 |
    | 3          | b3 | c3 |
    +------------+---+---+