Use TABLESAMPLE to sample rows from a table without scanning the full dataset. Three sampling methods are available:
TABLESAMPLE (BUCKET x OUT OF y [ON col_name | RAND()])— returns one bucket out of y equal-sized partitionsTABLESAMPLE (n PERCENT)— returns approximately n% of rowsTABLESAMPLE (m ROWS)— returns up to m rows at random
Percentage-based sampling returns approximate results. The actual percentage may differ from the specified value.
Syntax
Bucket-based sampling
TABLESAMPLE (BUCKET <x> OUT OF <y> [ON <col_name> | RAND()])| Parameter | Description |
|---|---|
x | The bucket to sample. Buckets are numbered starting from 1. |
y | The total number of buckets to divide the data into. |
col_name | One or more columns used to assign rows to buckets. Specify up to 10 columns in a single ON clause. Required if the table is not a clustered table. |
RAND() | Assigns rows to buckets randomly instead of by column value. Use this when no specific column distribution is needed. |
For non-clustered tables, you must include either ON col_name or ON RAND().
Percentage-based sampling
TABLESAMPLE (<n> PERCENT)n is the sampling percentage. The result contains approximately n% of the rows in the source table. The percentage is not exact.
Row-based sampling
TABLESAMPLE (<m> ROWS)m is the number of rows to return. If the source table has fewer rows than m, all rows are returned. The maximum value of m is 10,000.
Performance considerations
Clustered tables scan faster. Data in a clustered table is pre-organized into buckets, so bucket-based sampling reads directly from the target bucket without scanning unrelated data.
Choose y values that align with the table's bucket count. For a clustered table with 32 buckets, set
yto a multiple of 32 (such as 32, 64, or 128) or a divisor of 32 (such as 16, 8, or 4) to avoid partial bucket reads.
Sample data
The following examples use two tables.
BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 — a table from the MaxCompute public datasets. For details, see Overview.
tblsample_test — a clustered table created with the following statements:
-- Create the tblsample_test table.
CREATE TABLE tblsample_test(a bigint, b string, c string)
CLUSTERED BY (a, c) SORTED by (a, c) INTO 32 BUCKETS;
-- Insert data into the table.
INSERT OVERWRITE TABLE tblsample_test VALUES
(1,"b1","c1"),
(2,"b2","c2"),
(3,"b3","c3"),
(4,"b4","c4");
-- Verify the inserted data.
SELECT * FROM tblsample_test;
-- Result:
-- +------------+----+-----+
-- | a | b | c |
-- +------------+----+-----+
-- | 2 | b2 | c2 |
-- | 4 | b4 | c4 |
-- | 3 | b3 | c3 |
-- | 1 | b1 | c1 |
-- +------------+----+-----+Examples
Example 1: Sample by column values
Divide rows into buckets based on column values and return rows from a specific bucket.
-- Sample bucket 1 out of 1,000,000 from the public dataset, bucketed by isp_code, phoneno, and province.
SELECT isp_code,
phoneno,
province
FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020
TABLESAMPLE (BUCKET 1 OUT OF 1000000 ON isp_code, phoneno, province) s;
-- Result:
-- +----------+---------+----------+
-- | isp_code | phoneno | province |
-- +----------+---------+----------+
-- | 185 | 1853500 | Shanxi |
-- | 187 | 1878332 | Sichuan |
-- +----------+---------+----------+Example 2: Sample with random bucket assignment
Use RAND() to assign rows to buckets randomly, then return a specific bucket.
-- Sample bucket 3 out of 500,000 using random bucket assignment.
SELECT isp_code,
phoneno,
province
FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020
TABLESAMPLE (BUCKET 3 OUT OF 500000 ON RAND()) s;
-- Result:
-- +----------+---------+----------+
-- | isp_code | phoneno | province |
-- +----------+---------+----------+
-- | 131 | 1312224 | Shanghai |
-- | 135 | 1353936 | Guangdong |
-- | 158 | 1586377 | Shandong |
-- +----------+---------+----------+Example 3: Sample from a clustered table without specifying ON
For clustered tables, omit the ON clause. Rows are retrieved directly from the pre-organized bucket.
SELECT a, b FROM tblsample_test TABLESAMPLE (BUCKET 1 OUT OF 2) AS ts;
-- Result:
-- +------------+----+
-- | a | b |
-- +------------+----+
-- | 2 | b2 |
-- +------------+----+The tblsample_test table has 32 buckets. Using y = 2 (a divisor of 32) lets MaxCompute retrieve data from exact bucket boundaries, which improves performance. See Performance considerations.
Example 4: Sample by percentage
Return approximately 50% of rows from tblsample_test.
SELECT * FROM tblsample_test TABLESAMPLE (50 PERCENT) s;
-- Result:
-- +------------+----+----+
-- | a | b | c |
-- +------------+----+----+
-- | 2 | b2 | c2 |
-- | 3 | b3 | c3 |
-- +------------+----+----+Example 5: Sample a fixed number of rows
Return 2 rows from tblsample_test.
SELECT * FROM tblsample_test TABLESAMPLE (2 ROWS);
-- Result:
-- +------------+----+----+
-- | a | b | c |
-- +------------+----+----+
-- | 2 | b2 | c2 |
-- | 3 | b3 | c3 |
-- +------------+----+----+