All Products
Search
Document Center

MaxCompute:TABLESAMPLE

Last Updated:Mar 25, 2026

Use TABLESAMPLE to sample rows from a table without scanning the full dataset. Three sampling methods are available:

  • TABLESAMPLE (BUCKET x OUT OF y [ON col_name | RAND()]) — returns one bucket out of y equal-sized partitions

  • TABLESAMPLE (n PERCENT) — returns approximately n% of rows

  • TABLESAMPLE (m ROWS) — returns up to m rows at random

Percentage-based sampling returns approximate results. The actual percentage may differ from the specified value.

Syntax

Bucket-based sampling

TABLESAMPLE (BUCKET <x> OUT OF <y> [ON <col_name> | RAND()])
ParameterDescription
xThe bucket to sample. Buckets are numbered starting from 1.
yThe total number of buckets to divide the data into.
col_nameOne or more columns used to assign rows to buckets. Specify up to 10 columns in a single ON clause. Required if the table is not a clustered table.
RAND()Assigns rows to buckets randomly instead of by column value. Use this when no specific column distribution is needed.

For non-clustered tables, you must include either ON col_name or ON RAND().

Percentage-based sampling

TABLESAMPLE (<n> PERCENT)

n is the sampling percentage. The result contains approximately n% of the rows in the source table. The percentage is not exact.

Row-based sampling

TABLESAMPLE (<m> ROWS)

m is the number of rows to return. If the source table has fewer rows than m, all rows are returned. The maximum value of m is 10,000.

Performance considerations

  • Clustered tables scan faster. Data in a clustered table is pre-organized into buckets, so bucket-based sampling reads directly from the target bucket without scanning unrelated data.

  • Choose y values that align with the table's bucket count. For a clustered table with 32 buckets, set y to a multiple of 32 (such as 32, 64, or 128) or a divisor of 32 (such as 16, 8, or 4) to avoid partial bucket reads.

Sample data

The following examples use two tables.

BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020 — a table from the MaxCompute public datasets. For details, see Overview.

tblsample_test — a clustered table created with the following statements:

-- Create the tblsample_test table.
CREATE TABLE tblsample_test(a bigint, b string, c string)
  CLUSTERED BY (a, c) SORTED by (a, c) INTO 32 BUCKETS;

-- Insert data into the table.
INSERT OVERWRITE TABLE tblsample_test VALUES
  (1,"b1","c1"),
  (2,"b2","c2"),
  (3,"b3","c3"),
  (4,"b4","c4");

-- Verify the inserted data.
SELECT * FROM tblsample_test;
-- Result:
-- +------------+----+-----+
-- | a          | b  | c   |
-- +------------+----+-----+
-- | 2          | b2 | c2  |
-- | 4          | b4 | c4  |
-- | 3          | b3 | c3  |
-- | 1          | b1 | c1  |
-- +------------+----+-----+

Examples

Example 1: Sample by column values

Divide rows into buckets based on column values and return rows from a specific bucket.

-- Sample bucket 1 out of 1,000,000 from the public dataset, bucketed by isp_code, phoneno, and province.
SELECT isp_code,
       phoneno,
       province
  FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020
TABLESAMPLE (BUCKET 1 OUT OF 1000000 ON isp_code, phoneno, province) s;

-- Result:
-- +----------+---------+----------+
-- | isp_code | phoneno | province |
-- +----------+---------+----------+
-- | 185      | 1853500 | Shanxi   |
-- | 187      | 1878332 | Sichuan  |
-- +----------+---------+----------+

Example 2: Sample with random bucket assignment

Use RAND() to assign rows to buckets randomly, then return a specific bucket.

-- Sample bucket 3 out of 500,000 using random bucket assignment.
SELECT isp_code,
       phoneno,
       province
  FROM BIGDATA_PUBLIC_DATASET.life_service.phoneno_basic_info_2020
TABLESAMPLE (BUCKET 3 OUT OF 500000 ON RAND()) s;

-- Result:
-- +----------+---------+----------+
-- | isp_code | phoneno | province |
-- +----------+---------+----------+
-- | 131      | 1312224 | Shanghai  |
-- | 135      | 1353936 | Guangdong |
-- | 158      | 1586377 | Shandong  |
-- +----------+---------+----------+

Example 3: Sample from a clustered table without specifying ON

For clustered tables, omit the ON clause. Rows are retrieved directly from the pre-organized bucket.

SELECT a, b FROM tblsample_test TABLESAMPLE (BUCKET 1 OUT OF 2) AS ts;

-- Result:
-- +------------+----+
-- | a          | b  |
-- +------------+----+
-- | 2          | b2 |
-- +------------+----+
Note

The tblsample_test table has 32 buckets. Using y = 2 (a divisor of 32) lets MaxCompute retrieve data from exact bucket boundaries, which improves performance. See Performance considerations.

Example 4: Sample by percentage

Return approximately 50% of rows from tblsample_test.

SELECT * FROM tblsample_test TABLESAMPLE (50 PERCENT) s;

-- Result:
-- +------------+----+----+
-- | a          | b  | c  |
-- +------------+----+----+
-- | 2          | b2 | c2 |
-- | 3          | b3 | c3 |
-- +------------+----+----+

Example 5: Sample a fixed number of rows

Return 2 rows from tblsample_test.

SELECT * FROM tblsample_test TABLESAMPLE (2 ROWS);

-- Result:
-- +------------+----+----+
-- | a          | b  | c  |
-- +------------+----+----+
-- | 2          | b2 | c2 |
-- | 3          | b3 | c3 |
-- +------------+----+----+