You can use the TABLESAMPLE feature provided by MaxCompute to sample table data. This feature provides three sampling methods: bucket-based sampling, percentage-based sampling, and random sampling. If you use percentage-based sampling, records are returned based on a specified percentage. If you use random sampling, a specified number of random records are returned. This topic describes the syntax of TABLESAMPLE and provides examples on how to use TABLESAMPLE.
Syntax
Bucket-based sampling
TABLESAMPLE (BUCKET <x> OUT OF <y> [ON <col_name> | rand()])Parameter description
x and y: required. Data in the source table is divided into y buckets that are numbered from 1. Data in the x-th bucket is sampled.
col_name: the names of the columns whose data you want to sample. If the table is not a clustered table, the
col_nameorrand()function must be used. If therand()function is used, input data is divided into buckets at random. You can specify a maximum of 10 columns in theONclause.
Percentage-based sampling
TABLESAMPLE (<n> PERCENT)In the preceding syntax,
nspecifies a sampling percentage value. If percentage-based sampling is used,n%of data is sampled. This indicates that the percentage of returned records to all records in the source table is aboutn%. The percentage value is not absolutely accurate.Random sampling
TABLESAMPLE (<m> ROWS)mspecifies the number of records that are returned. If the total number of records in the source table is less than the value ofm, all records in the source table are returned. The maximum value ofmis 10000.
Sample data
The following sample tables are used:
BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020
The
BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020table is a table in MaxCompute public datasets. For more information about public datasets, see Overview.tblsample_test
The
tblsample_testtable is a clustered table. The following CREATE TABLE and INSERT OVERWRITE statements are used to create the table and insert data into the table.-- Create the tblsample_test table. CREATE TABLE tblsample_test(a bigint, b string, c string) CLUSTERED BY (a, c) SORTED by (a, c) INTO 32 BUCKETS; -- Insert data into the table. insert overwrite table tblsample_test values(1,"b1","c1"), (2,"b2","c2"), (3,"b3","c3"), (4,"b4","c4"); -- Query data from the table. select * from tblsample_test; -- The following result is returned: +------------+---+-----+ | a | b | c | +------------+---+-----+ | 2 | b2 | c2 | | 4 | b4 | c4 | | 3 | b3 | c3 | | 1 | b1 | c1 | +------------+---+---+
Examples
Example 1: Divide input data into buckets based on column values and sample data from a specified bucket.
-- Sample data from the BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 table. SELECT isp_code, phoneno, province FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 TABLESAMPLE (BUCKET 1 OUT of 1000000 ON isp_code, phoneno, province) s; -- The following result is returned: +----------+---------+----------+ | isp_code | phoneno | province | +----------+---------+----------+ | 185 | 1853500 | Shanxi | | 187 | 1878332 | Sichuan | +----------+---------+----------+Example 2: Use the
RAND()function to divide input data into buckets at random and sample data from a specified bucket.-- Sample data from the BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 table. SELECT isp_code, phoneno, province FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 TABLESAMPLE (BUCKET 3 OUT OF 500000 ON RAND()) s; -- The following result is returned: +----------+---------+----------+ | isp_code | phoneno | province | +----------+---------+----------+ | 131 | 1312224 | Shanghai | | 135 | 1353936 | Guangdong | | 158 | 1586377 | Shandong | +----------+---------+----------+Example 3: Sample data from the tblsample_test table without specifying the
ONclause.select a, b from tblsample_test TABLESAMPLE (BUCKET 1 OUT OF 2) as ts; -- The following result is returned: +------------+---+ | a | b | +------------+---+ | 2 | b2| +------------+---+NoteData in the clustered table is stored in buckets. Therefore, data is directly retrieved from a bucket during sampling. This way, sampling performance is improved. The
tblsample_testtable is divided into 32 buckets when the table is created. When you sample data, sampling performance can be improved if the value ofyis a multiple of 32, such as32, 64, or 128or 32 is divisible by the value of y, such as16, 8, or 4.Example 4: Sample
n%of data from the tblsample_test table. In this example, the percentage of sampled records to all records in the source table is aboutn%. The percentage value is not absolutely accurate.-- Sample 50% of data from the tblsample_test table. SELECT * FROM tblsample_test TABLESAMPLE (50 PERCENT) s; -- The following result is returned: +------------+---+---+ | a | b | c | +------------+---+---+ | 2 | b2 | c2 | | 3 | b3 | c3 | +------------+---+---+Example 5: Sample
nrecords from the tblsample_test table.select * FROM tblsample_test TABLESAMPLE (2 ROWS); -- The following result is returned: +------------+---+---+ | a | b | c | +------------+---+---+ | 2 | b2 | c2 | | 3 | b3 | c3 | +------------+---+---+