You can use the TABLESAMPLE feature provided by MaxCompute to sample table data. This feature provides three sampling methods: bucket-based sampling, percentage-based sampling, and random sampling. If you use percentage-based sampling, records are returned based on a specified percentage. If you use random sampling, a specified number of random records are returned. This topic describes the syntax of TABLESAMPLE and provides examples on how to use TABLESAMPLE.
Syntax
Bucket-based sampling
TABLESAMPLE (BUCKET <x> OUT OF <y> [ON <col_name> | rand()])
Parameter description
x and y: required. Data in the source table is divided into y buckets that are numbered from 1. Data in the x-th bucket is sampled.
col_name: the names of the columns whose data you want to sample. If the table is not a clustered table, the
col_name
orrand()
function must be used. If therand()
function is used, input data is divided into buckets at random. You can specify a maximum of 10 columns in theON
clause.
Percentage-based sampling
TABLESAMPLE (<n> PERCENT)
In the preceding syntax,
n
specifies a sampling percentage value. If percentage-based sampling is used,n%
of data is sampled. This indicates that the percentage of returned records to all records in the source table is aboutn%
. The percentage value is not absolutely accurate.Random sampling
TABLESAMPLE (<m> ROWS)
m
specifies the number of records that are returned. If the total number of records in the source table is less than the value ofm
, all records in the source table are returned. The maximum value ofm
is 10000.
Sample data
The following sample tables are used:
BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020
The
BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020
table is a table in MaxCompute public datasets. For more information about public datasets, see Overview.tblsample_test
The
tblsample_test
table is a clustered table. The following CREATE TABLE and INSERT OVERWRITE statements are used to create the table and insert data into the table.-- Create the tblsample_test table. CREATE TABLE tblsample_test(a bigint, b string, c string) CLUSTERED BY (a, c) SORTED by (a, c) INTO 32 BUCKETS; -- Insert data into the table. insert overwrite table tblsample_test values(1,"b1","c1"), (2,"b2","c2"), (3,"b3","c3"), (4,"b4","c4"); -- Query data from the table. select * from tblsample_test; -- The following result is returned: +------------+---+-----+ | a | b | c | +------------+---+-----+ | 2 | b2 | c2 | | 4 | b4 | c4 | | 3 | b3 | c3 | | 1 | b1 | c1 | +------------+---+---+
Examples
Example 1: Divide input data into buckets based on column values and sample data from a specified bucket.
-- Sample data from the BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 table. SELECT isp_code, phoneno, province FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 TABLESAMPLE (BUCKET 1 OUT of 1000000 ON isp_code, phoneno, province) s; -- The following result is returned: +----------+---------+----------+ | isp_code | phoneno | province | +----------+---------+----------+ | 185 | 1853500 | Shanxi | | 187 | 1878332 | Sichuan | +----------+---------+----------+
Example 2: Use the
RAND()
function to divide input data into buckets at random and sample data from a specified bucket.-- Sample data from the BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 table. SELECT isp_code, phoneno, province FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 TABLESAMPLE (BUCKET 3 OUT OF 500000 ON RAND()) s; -- The following result is returned: +----------+---------+----------+ | isp_code | phoneno | province | +----------+---------+----------+ | 131 | 1312224 | Shanghai | | 135 | 1353936 | Guangdong | | 158 | 1586377 | Shandong | +----------+---------+----------+
Example 3: Sample data from the tblsample_test table without specifying the
ON
clause.select a, b from tblsample_test TABLESAMPLE (BUCKET 1 OUT OF 2) as ts; -- The following result is returned: +------------+---+ | a | b | +------------+---+ | 2 | b2| +------------+---+
NoteData in the clustered table is stored in buckets. Therefore, data is directly retrieved from a bucket during sampling. This way, sampling performance is improved. The
tblsample_test
table is divided into 32 buckets when the table is created. When you sample data, sampling performance can be improved if the value ofy
is a multiple of 32, such as32, 64, or 128
or 32 is divisible by the value of y, such as16, 8, or 4
.Example 4: Sample
n%
of data from the tblsample_test table. In this example, the percentage of sampled records to all records in the source table is aboutn%
. The percentage value is not absolutely accurate.-- Sample 50% of data from the tblsample_test table. SELECT * FROM tblsample_test TABLESAMPLE (50 PERCENT) s; -- The following result is returned: +------------+---+---+ | a | b | c | +------------+---+---+ | 2 | b2 | c2 | | 3 | b3 | c3 | +------------+---+---+
Example 5: Sample
n
records from the tblsample_test table.select * FROM tblsample_test TABLESAMPLE (2 ROWS); -- The following result is returned: +------------+---+---+ | a | b | c | +------------+---+---+ | 2 | b2 | c2 | | 3 | b3 | c3 | +------------+---+---+