Range clustering - MaxCompute - Alibaba Cloud Documentation Center

Range clustering is a new data clustering method that distributes data in a globally sorted order. Range clustering can prevent data skew issues that may be caused by hash clustering. Range clustering also allows you to create two levels of indexes. Range clustering is suitable for scenarios such as range queries based on cluster keys and multi-key queries. This topic describes how to use range clustering in MaxCompute.

Background Information

Hash-clustered tables have the following advantages:

If you want to query data based on a specific column value, you can use the hash algorithm to directly locate the hash bucket. This process is called bucket pruning. If data in the bucket is stored in a sorted manner, you can further use indexes to locate the data. This reduces the amount of scanned data and improves query efficiency.
If you join two tables by a specific column in each table and the column in one of the tables is hashed, the shuffle step can be removed. This helps save computing resources.

For more information about the hash clustering feature, see Hash clustering.

Hash clustering has the following limits:

If you use the hash algorithm to create buckets, data skew issues may occur. Similar to join skew issues, data skew issues are inherent in the hash algorithm. If input data is unevenly distributed across buckets, data skew issues may occur. As a result, the amount of data in buckets greatly differs. If you use hash clustering, each bucket is a unit for concurrent processing in most cases. If the amount of data in buckets is different, long tails are likely to occur.
Bucket pruning supports only equality queries. If you want to query data based on non-equality conditions, such as the condition that the values of a column are greater than 0, the buckets in which the queried data is stored cannot be located. In this case, you must query data from all the buckets.
For queries based on multiple cluster keys, the query performance can be improved only if all cluster keys are available and all the query conditions are equality conditions.
For example, you want to query data from the table that is created by using the following statement. The query performance can be improved only if C1=x AND C2=y is used as the query condition. If you use C1=x or C2=y as the query condition, the query cannot be accelerated based on hash clustering. This is because the hash values of keys are combined into pairs when the keys are used for queries. If the hash values of keys are not combined, the bucket in which the queried data is stored cannot be located or bucket pruning cannot be implemented.
```
CREATE TABLE T2 (C1 int, C2 int, C3 string)
   CLUSTERED BY (C1, C2)
      SORTED by (C1, C2)
       INTO 1024 BUCKETS;
```

To address the limitations, MaxCompute provides a new data clustering method, which is called range clustering.

Feature description

Range clustering divides data into multiple disjointed ranges based on the full sorting of cluster keys. Each range is considered a bucket and must meet both of the following conditions:

Duplicate values are stored in the same bucket.
The number of values in each bucket is approximate.

The following sample statement creates a table named T.

CREATE TABLE T (C1 int) 
    RANGE CLUSTERED BY (C1) 
    SORTED BY (c1) 
    INTO 3 BUCKETS;

The values in the C1 column are { 1, 8, -3, 2, 4, 1, 1, 3, 8, 20, -8, 9 }.

After range clustering is enabled, the following buckets are obtained.

Bucket 0 : { -8, -3, 1, 1, 1 }
Bucket 1 : { 2, 3, 4 }
Bucket 2 : { 8, 8, 9, 20 }

Note

The ranges represented by buckets may be disjointed. For example, the range of Bucket 1 is [2, 4] and the range of Bucket 2 is [8, 20]. No values are in the range of (4, 8).
Range clustering aims to make bucket sizes approximate rather than making range sizes approximate. When you compute data, each bucket is a unit for concurrent processing. The same bucket size prevents long tail issues. However, data distribution in each range may not be the same. Therefore, consistent bucket sizes do not represent consistent range sizes.

The range clustering process is automatically implemented by MaxCompute. You do not need to manually specify each range. In big data scenarios, manual configuration of ranges is not efficient or feasible. MaxCompute automatically sorts and samples data, creates a histogram based on the data distribution of each range, and then combines and calculates the histogram of each range. This way, MaxCompute achieves the optimal performance of range clustering.

When you create a table, you can specify both RANGE CLUSTERED BY and SORTED BY to ensure that data is globally sorted. Then, MaxCompute automatically creates two levels of indexes: global index and file index. The indexes are used to quickly locate and search for key values, as shown in the following figure. range clustering

Range clustering has the following advantages over hash clustering:

Range queries are supported.
For example, if the query condition is c < 3, the system can exclude Bucket 2 and Bucket 3 based on the global index, and then query data from Bucket 0 and Bucket 1. For hash clustering, bucket pruning can be performed only for equality queries.
Multi-key queries are supported.
For example, if you specify RANGE CLUSTERED BY (c1, c2, c3) SORTED BY (c1, c2, c3) when you create a table, range clustering and data storage are implemented in the order of c1, c2, and c3. This way, you can query the table data based on complex conditions, such as c1 = 100 AND c2 > 0 or c1 = 100 AND c2 = 50 AND c3 < 5. This type of query cannot be implemented based on hash clustering.
Important
For a query based on multiple keys, keys in the query condition must be sorted in order and only the last key can be used to specify a value range.
Range clustering provides an efficient implementation of global sorting.
Before range clustering is used, MaxCompute can use only one instance to globally sort data, which results in low efficiency. After range clustering is used, data in each range can be concurrently sorted and then combined, which significantly improves efficiency.

Usage notes

The syntax of range clustering is similar to that of hash clustering. The difference is that the range keyword and the number of buckets are optional for range clustering.

Create a range-clustered table

You can use the CREATE TABLE statement to create a range-clustered table. In this statement, you must specify the RANGE CLUSTERED BY parameter. The INTO number_of_buckets BUCKETS and SORTED BY parameters are optional. In most cases, we recommend that you specify the same values in SORTED BY and RANGE CLUSTERED BY. This way, the optimal optimization effect can be achieved.

Syntax

CREATE TABLE [IF NOT EXISTS] <table_name>
             [(<col_name> data_type [comment <col_comment>], ...)]
             [comment table_comment]
             [PARTITIONED BY (<col_name> data_type [comment <col_comment>], ...)]
             [RANGE CLUSTERED BY (<col_name> [, <col_name>, ...])
             [SORTED BY (<col_name> [ASC | DESC]
             [, <col_name> [ASC | DESC] ...])]
             [INTO <number_of_buckets> BUCKETS]]
             [AS select_statement]

Examples

Non-partitioned table

CREATE TABLE T1 (a string, b string, c int)
             RANGE CLUSTERED BY (c)
             SORTED by (c)
             INTO 1024 BUCKETS;

Partitioned table

CREATE TABLE T1 (a string, b string, c int)
             PARTITIONED BY (dt int)
             RANGE CLUSTERED BY (c)
             SORTED by (c)
             INTO 1024 BUCKETS;

Parameters
- RANGE CLUSTERED BY
  Specifies the keys for range clustering. After you specify the parameter, MaxCompute sorts and samples one or more columns of data into appropriate ranges based on the number of buckets that you specified. To prevent data skew issues and hot spots and improve the performance of concurrent queries, we recommend that you specify columns with large value ranges and few duplicate key values in RANGE CLUSTERED BY. To optimize query performance, we recommend that you specify the commonly used aggregate keys or filter keys in RANGE CLUSTERED BY.
- SORTED BY
  Specifies how to sort fields in a bucket. To achieve optimal query performance, we recommend that you specify the same values in SORTED BY and RANGE CLUSTERED BY. After you specify SORTED BY, MaxCompute automatically generates a global index and file indexes and uses the indexes to accelerate queries.
- INTO number_of_buckets BUCKETS
  Unlike hash clustering, INTO number_of_buckets BUCKETS is optional for range clustering. If you do not specify this parameter, MaxCompute automatically determines the number of buckets based on the amount of data. In most cases, we recommend that you specify the number of buckets based on actual situations.
  Similar to hash clustering, we recommend that you set the number of buckets based on the bucket size (512 MB to 1 GB) for range clustering. For excessively large tables, a large number of buckets are required. However, we recommend that the number of buckets in a table does not exceed 4,000.

Change the hash clustering properties of a table

For partitioned tables, you can execute the ALTER TABLE statement to add or remove the range clustering properties.

Syntax

-- Change a table to a range-clustered table.
ALTER TABLE <table_name> [RANGE CLUSTERED BY (<col_name> [, <col_name>, ...])
                         [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])]
                         [INTO <number_of_buckets> BUCKETS];
-- Change a range-clustered table to a non-range-clustered table.
ALTER TABLE <table_name> NOT CLUSTERED;

Usage notes
- The ALTER TABLE statement can only modify the clustering properties of a partitioned table. For a non-partitioned table, the clustering properties cannot be changed after the clustering properties are added to the table.
- The ALTER TABLE statement takes effect only for the new partitions of a table, which include the new partitions generated by using the INSERT OVERWRITE statement. New partitions are stored based on the clustering properties. The storage formats of existing partitions remain unchanged.
- The ALTER TABLE statement takes effect only for the new partitions of a table. Therefore, you cannot specify partitions in this statement.

The ALTER TABLE statement is suitable for existing tables. After the range clustering properties are added, new partitions are stored based on the range clustering properties.

Explicitly verify table properties

After you create a range-clustered table, you can execute the following statement to view the table properties. The range clustering properties are displayed in Extended Info of the returned result.

DESC EXTENDED <table_name>;

The following figure shows an example of the returned result. 表属性验证 For a partitioned table, you can also execute the following statement to view the clustering properties of a partition.

DESC EXTENDED <table_name> partition(<pt_spec>);

The following figure shows an example of the returned result. 分区表range属性验证

Scenarios

Optimization of queries by filtering

If range clustering is enabled for a table, data in the table is globally sorted. MaxCompute automatically creates a global index and file indexes based on the sorted data. This improves the efficiency of data filtering based on the data storage characteristics. You can use range clustering to optimize equality queries and range queries.

For example, for a simple query condition id < 3, the system extracts the condition from the optimizer and converts the condition into the value range (-∞, 3). In this case, the system can use the global index for bucket pruning to exclude both Bucket 2 and Bucket 3 whose data is not within the preceding value range. Then, the system can use the index of each file in Bucket 0 and Bucket 1 to quickly locate the data. This process is called predicate pushdown, as shown in the following figure. 过滤查询优化 The following sample statement is a TPC-H Query 6 statement that is used to query data from a dataset of 100 GB after range clustering is performed on the dataset. In a TPC-H Query 6 statement, an aggregate operation is performed based on range filtering. Range clustering can use two levels of indexes to quickly locate data. This way, the query execution duration and the consumed CPU and memory resources are significantly reduced.

select sum(l_extendedprice * l_discount) as revenue
  from tpch_lineitem l
 where l_shipdate >= '1994-01-01'
   and l_shipdate < '1995-01-01'
   and l_discount >= 0.05
   and l_discount <= 0.07
   and l_quantity < 24;

Multi-key queries

In this example, the following statement is used to change the mf_tab table to a range-clustered table. This helps you better understand multi-key queries.

ALTER TABLE mf_project.mf_tab 
            RANGE CLUSTERED BY (project_name, name)
            SORTED BY (project_name, name)
            INTO 1024 BUCKETS;

After the table is changed to a range-clustered table, you can perform aggregate queries at the project level. Sample statement:

SELECT COUNT(*)
  from mf_project.mf_tab
 WHERE project_name="xxxdw"
   AND ds="20180115"
   AND type="TABLE";

You can also use multiple keys to precisely locate a table. Sample statement:

SELECT count(*)
  from mf_project.mf_tab
 WHERE project_name="xxxdw"
   AND name="adm_ctu_cle_kba_midun_trade_dd"
   AND type="TABLE";

You can also use multiple keys for range queries. The following statement queries the tables whose names start with adm.

SELECT count(*)
  from mf_project.mf_tab
 WHERE project_name="xxxdw"
   AND name>="adm"
   AND name < "adn"
   AND type="TABLE";

All of the preceding queries can fully utilize the global sorting feature of range clustering and perform predicate pushdown to reduce the number of I/O operations for table scanning and save CPU and memory resources that are consumed for data filtering and computing.

If multiple keys are used for range clustering, specific requirements must be met. For RANGE CLUSTERED BY k0, k1, ..., kn in a table creation statement, if km is used for data queries, k0, k1, ..., km-1 must all be specified in conditions and all the conditions must be equality conditions. This way, optimal performance of index-based query acceleration can be achieved.

For example, k1, k2 are cluster keys in a table named T.

If the query condition is k1 < 5, index-based query acceleration can be achieved.
If the query condition is k1 = 10 AND k2 = 20, index-based query acceleration can be achieved.
If the query condition is k1 = 10 AND k2 < 0, index-based query acceleration can be achieved.
If the query condition is k2 < 0, index-based query acceleration cannot be achieved. This is because k1 is not specified in the query condition.
If the query condition is k1 < 0 AND k2 > 0, index-based query acceleration can be used to obtain the data that meets the condition k1 < 0. For the data that meets the condition k2 > 0, table scanning is required.

GROUP BY optimization

If range clustering is enabled for a table, data in the table is globally sorted. The keys with the same values are placed in the same bucket during range clustering. This physical property of the data can be used to remove the shuffle step during aggregate operations.

For example, a table named T is created by using the following CREATE TABLE statement. If you query data from the table, you can perform the GROUP BY operation on the table data in a map stage.

CREATE TABLE T (department int, team string, employee string)
       RANGE CLUSTERED BY (department, team)
       SORTED BY (c1, c2)
       INTO 1024 BUCKETS;

SELECT COUNT(*) from T GROUP BY department, team;

Note

To achieve optimal performance of GROUP BY, you must specify the same keys in GROUP BY and RANGE CLUSTERED BY.

Aggregate optimization

The following statement shows the data structure of the table foo.

create table foo(a bigint, b bigint, c bigint)
       range clustered by (a,b)
       sorted by(a,b) into 3 buckets;

Data stored in buckets of the table foo falls in the following ranges:

Bucket 0: [1,1 : 3,3]
Bucket 1: [5,5 : 7,7]
Bucket 2: [8,8 : 9,9]

The preceding bucket ranges are specified in the format of Bucket N: [lower bound values : upper bound values]. If data is aggregated by Column a, the following bucket ranges are used instead:

Bucket 0: [1 : 3]
Bucket 1: [5 : 7]
Bucket 2: [8 : 9]

You can directly generate an execution plan for aggregate operations by Column a and Column b, start three instances to aggregate data in each bucket, and then return the output result.

However, if the values of Column a are distributed across multiple buckets, an invalid result is returned. Example:

Bucket 0: [1,1 : 3,3]
Bucket 1: [3,5 : 7,7]
Bucket 2: [7,8 : 9,9]

Column a has two values 3 and 7, which are separately stored in two buckets. To obtain a valid result, you must place the tuples that have the same value in Column a into the same instance and aggregate the tuples. This way, buckets are created again, as shown in the following figure. The spacing between two dashed lines in red specifies the range of data that can be read by each instance. 归并优化 Histograms are required for range clustering. For a range-clustered table, if cluster keys and sort keys are the same, the worker that corresponds to each bucket samples a tuple for every 10,000 rows to obtain the values of cluster keys when you insert data into the range-clustered table. The obtained values are saved in a histogram. The histogram of each bucket is stored in cluster metadata files. This type of histogram is called equi-depth histogram.

Note

Tuples are sampled only if cluster keys and sort keys are the same.

After the histogram of each bucket is obtained, you can re-create buckets for each worker based on the following rules:

The tuples that have the same grouping key are stored in the same bucket.
Data is evenly distributed across buckets.

Based on the lower bound values of each new bucket, each worker can read data in a valid range and return a valid result.

The following content uses the partsupp table with 1 TB of data in a TPC-H dataset to test performance improvements. Execute the following statement to transform the partsupp table into a range-clustered table:

CREATE TABLE partsupp ( PS_PARTKEY BIGINT NOT NULL,
                        PS_SUPPKEY BIGINT NOT NULL,
                        PS_AVAILQTY BIGINT NOT NULL,
                        PS_SUPPLYCOST  DECIMAL(15,2)  NOT NULL,
                        PS_COMMENT     VARCHAR(199) NOT NULL)
             RANGE CLUSTERED BY(PS_PARTKEY, PS_SUPPKEY)
             SORTED BY(PS_PARTKEY, PS_SUPPKEY) INTO 128 BUCKETS;

Execute the following query statement to perform testing:

SELECT ps_partkey, count(*) c FROM partsupp GROUP BY ps_partkey;

Run the following command to disable optimization:
```
 set odps.optimizer.enable.range.partial.repartitioning=false;
```
Sample output:
Run the following command to enable optimization:
```
set odps.optimizer.enable.range.partial.repartitioning=true;
```
Sample output:

The test result indicates that the query speed is improved by 57%, the CPU utilization is decreased by 52%, and the memory usage is decreased by 71% after the optimization is enabled. Performance improvements vary based on the amount of data and query types.

Join optimization of range-clustered tables

In this example, create two tables by using the following statements:
```
create table t1(a bigint, b bigint, c bigint, d bigint)
       range clustered by(a,b,c)
       sorted by(a,b,c) into 3 buckets;

create table t2(a bigint, b bigint, c bigint, d bigint)
       range clustered by(a,b,c)
       sorted by(a,b,c) into 3 buckets;
```
Then, insert different data into the two tables.
For two hash-clustered tables that need to be joined, if the number of buckets is the same for the two tables, the data in the buckets of the tables can be joined. However, this rule does not apply to range-clustered tables. For two range-cluster tables that need to be joined, data in the buckets of the two tables cannot be directly joined based on bucket IDs even if the number of buckets is the same for the tables. This is because the boundary of each bucket in a range-clustered table may be different. If you join two range-clustered tables, an execution plan that has the shuffle step is always generated, as shown in the following figure.To optimize joins between two range-clustered tables, re-create buckets for the two tables by aligning the boundaries of the tables. This way, the boundaries of data that can be read by each instance are redefined.
Create two tables.
```
create table t1(a bigint, b bigint, c bigint, d bigint)
       range clustered by(a,b,c)
       sorted by(a,b,c) into 5 buckets;

create table t2(a bigint, b bigint, c bigint, d bigint)
       range clustered by(a,b,c)
       sorted by(a,b,c) into 3 buckets;
```
After you insert a certain amount of data into the tables, the bucket boundaries are defined, as shown in the following figure.Sample query 1:
```
SELECT * FROM t1 JOIN t2 ON t1.a=t2.a AND t1.b=t2.b AND t1.c=t2.c;
```
The optimizer aligns the boundary of a table that has more buckets with the boundary of another table, and obtains a new boundary for each table, as shown in the following figure.This way, an execution plan that does not have the shuffle step is generated.Sample query 2:
```
SELECT * FROM t1 JOIN t2 ON t1.a=t2.a AND t1.b=t2.b;
```
The optimizer creates buckets for each table based on Column a and Column b, and then aligns and re-creates buckets to obtain boundaries based on which data is read from each bucket of each table. This way, an execution plan that does not have the shuffle step is generated, as shown in the preceding figure.

Performance testing

Table transformation

Use a TPC-H Query 2 statement to perform testing on two tables named PART and PARTSUPP. Each of the tables contains 1 TB of data. Transform the two tables into range-clustered tables, and remain other tables unchanged.

CREATE TABLE PARTSUPP ( PS_PARTKEY BIGINT NOT NULL,
                        PS_SUPPKEY BIGINT NOT NULL,
                        PS_AVAILQTY BIGINT NOT NULL,
                        PS_SUPPLYCOST  DECIMAL(15,2)  NOT NULL,
                        PS_COMMENT VARCHAR(199) NOT NULL)
             RANGE CLUSTERED BY(PS_PARTKEY, PS_SUPPKEY)
             SORTED BY(PS_PARTKEY, PS_SUPPKEY)
             INTO 128 BUCKETS;
CREATE TABLE PART ( P_PARTKEY BIGINT NOT NULL,
                    P_NAME VARCHAR(55) NOT NULL,
                    P_MFGR CHAR(25) NOT NULL,
                    P_BRAND CHAR(10) NOT NULL,
                    P_TYPE  VARCHAR(25) NOT NULL,
                    P_SIZE   BIGINT NOT NULL,
                    P_CONTAINER   CHAR(10) NOT NULL,
                    P_RETAILPRICE DECIMAL(15,2) NOT NULL,
                    P_COMMENT VARCHAR(23) NOT NULL)
             RANGE CLUSTERED BY(P_PARTKEY)
             SORTED BY(P_PARTKEY)
             INTO 64 BUCKETS;

Query data by using the following TPC-H Query 2 statement:

select s_acctbal,
       s_name,
       n_name,
       p_partkey,
       p_mfgr,
       s_address,
       s_phone,
       s_comment
  from part,
       supplier,
       partsupp,
       nation,
       region
 where p_partkey = ps_partkey
   and s_suppkey = ps_suppkey
   and p_size = 15
   and p_type like '%BRASS'
   and s_nationkey = n_nationkey
   and n_regionkey = r_regionkey
   and r_name = 'EUROPE'
   and ps_supplycost = (select min(ps_supplycost)
                          from partsupp, supplier, nation, region
                         where p_partkey = ps_partkey
                           and s_suppkey = ps_suppkey
                           and s_nationkey = n_nationkey
                           and n_regionkey = r_regionkey
                           and r_name = 'EUROPE')
  order by s_acctbal desc, n_name, s_name, p_partkey limit 100;

Test results
- Run the following command to disable optimization:
```
 set odps.optimizer.enable.range.partial.repartitioning=false;
```
  Sample output:
- Run the following command to enable optimization:
```
set odps.optimizer.enable.range.partial.repartitioning=true;
```
  Sample output:
After the optimization, two stages are removed, the query speed improves by about 21.4%, the CPU utilization is decreased by about 35.4%, and the memory usage is decreased by about 54.6%.

Global sorting acceleration

Range clustering can also be used for global sorting acceleration. In common scenarios in which ORDER BY is used, all sorted data is distributed to the same instance to ensure global sorting. However, concurrent processing cannot be fully utilized in these scenarios. You can use the partitioning step of range clustering to implement concurrent global sorting. For global sorting, you must sample data and divide data into ranges, sort data in each range in parallel, and then obtain the result of global sorting.

After global sorting is complete, multiple buckets are still included in a table when you modify the cluster properties of the table or a partition in the table. During data consumption, data in files must be read based on bucket IDs to ensure global sorting.

By default, global sorting acceleration is disabled for range-clustered tables. To enable global sorting acceleration, you can run the following command:

set odps.optimizer.distribute.ordering.enable=true;

Limits and usage notes

Compared with hash clustering, range clustering has the following limits:

The data generation costs of range clustering are higher than those of hash clustering. Hash clustering is only a simple operation for data hashing and sorting. However, for range clustering, data sampling, sorting, and histogram combination are required. The overall consumption, including execution durations, CPU costs, and memory costs, is higher than the overall consumption of hash clustering. Therefore, if hash clustering can be used to resolve issues, you do not need to use range clustering.
Range clustering is not supported in DYNAMIC PARTITION or INSERT INTO.
Range clustering is supported only for the following join operations: inner join, left outer join, right outer join, and semi join. Range clustering is not supported for anti-join or full outer join.
For range-clustered tables, the keys specified in RANGE CLUSTERED BY must be the same as those specified in SORTED BY. For example, if range clustered by (a,b) sorted by (a,b) is specified for a table named foo when you create the table, the optimization that is described in this topic can be achieved on the table. However, if range clustered by(a,b) sorted by (b,a) is specified for a table named bar, the optimization that is described in this topic cannot be achieved on the table.
Keys specified in JOIN or GROUP BY must be the prefixes or all of the keys that are specified in RANGE CLUSTERED BY. For example, range clustered by(a,b,c) sorted by(a,b,c) is specified in a table creation statement. The optimization that is described in this topic can be achieved on the table only if a, a,b, or a,b,c is specified as the key in JOIN or GROUP BY. The optimization that is described in this topic cannot be achieved if b or a,c is specified as the key in JOIN or GROUP BY.
For a range-clustered partitioned table, if you want to read data from two or more partitions of the table, the optimization that is described in this topic cannot be achieved. The optimization that is described in this topic can be achieved only on partitioned tables that have a single partition and non-partitioned tables.