MaxCompute Unleashed - Part 7: Grouping Set, Cube and Rollup

MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.

Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).

The previous article introduced User Defined Type. This article focuses on MaxCompute's support for GROUPING SETS.

Scenarios

In many cases, data needs to be aggregated and analyzed across multiple dimensions due to business requirements. For example, there may be a need to aggregate data by column A, column B, or both column A and column B simultaneously. This often leads to writing numerous UNION ALL statements, resulting in repetitive code and difficult maintenance.

The problem of this scenario can be effectively solved by utilizing Grouping Sets.

Examples in this article are demonstrated using MaxCompute Studio. For users who haven't installed MaxCompute Studio, please refer to the installation guide to import and test MaxCompute projects and create new projects.

Feature Introduction

The GROUPING SETS feature in MaxCompute is an extension of the GROUP BY clause in the SELECT statement. The function allows multiple ways to group results without having to use multiple SELECT statements. In this case, the engine of MaxCompute can generate more efficient execution plans with higher performance.

The example is as follows:

• Prepare the data. The data source, the requests table, records requests received by a system.

create table requests LIFECYCLE 20 as
select * from values
    (1, 'windows', 'PC', 'Beijing'),
    (2, 'windows', 'PC', 'Shijiazhuang'),
    (3, 'linux', 'Phone', 'Beijing'),
    (4, 'windows', 'PC', 'Beijing'),
    (5, 'ios', 'Phone', 'Shijiazhuang'),
    (6, 'linux', 'PC', 'Beijing'),
    (7, 'windows', 'Phone', 'Shijiazhuang')
as t(id, os, device, city);

• Requirements: Calculate the number of requests in the city dimension and the os and device dimensions. And calculate the total number of requests at the same time.

• In the previous solution, multiple SELECT statements must be used to calculate multiple groups, and the groups must be connected with UNION ALL:

  SELECT NULL, NULL, NULL, COUNT(*)
  FROM requests
  UNION ALL
  SELECT os, device, NULL, COUNT(*)
  FROM requests GROUP BY os, device
  UNION ALL
  SELECT NULL, NULL, city, COUNT(*)
  FROM requests GROUP BY city;

From the execution graph of MaxCompute Studio, we can see that the physical execution plan performs aggregation three times and then performs UNION.

• The GROUPING SETS syntax can achieve the same logic while using fewer codes and consuming fewer cluster resources:

  SELECT os,device, city ,COUNT(*)
  FROM requests
  GROUP BY os, device, city GROUPING SETS((os, device), (city), ());

• The syntax of GROUPING SETS is similar to that of GROUP BY. However, you must perform extra GROUP BY as required. Take ((os, device), (city), ()) in the preceding SQL statement as an example. Please note that this is a two-layer parenthesis, and each inner parenthesis executes a GROUP BY. An empty parenthesis indicates that the GROUP BY list is empty, that is, COUNT all columns.

• Observing the execution graph of MaxCompute Studio, we find that the physical execution plan contains only one Reduce phase and does not need to perform UNION.

• Both methods produce the same results, as follows:

Note:

• If no expression is used in the GROUPIING SET, the system uses NULL as a placeholder so that these result sets can be used for UNION. Take the city column in rows 1 to 5 as an example.

• Compared with performing UNION on the implementation of multiple group by, the GROUPINGSETS method is superior in total resource consumption. However, using GROUPINGSETS reduces the number of Reducer stages. As shown in the preceding example, Three Reducers (R2_1, R3_1, R4_1) are changed to one Reducer (R2_1). As a result, the total number of Reducer instances is reduced, which may increase the end-to-end time of the task. In this case, we recommend that you use the odps.sql.reducer.instances to manually increase the number of Reducer instances. As shown in the preceding example, you can use set odps.sql.reducer.instances=3; to keep the number of instances unchanged.

CUBE and ROLLUP

CUBE and ROLLUP can be considered as special GROUPING SETS functions.

CUBE enumerates all possible combinations of the specified columns as GROUPING SETS. ROLLUP hierarchically generates GROUPING SETS.

For example:

• GROUP BY CUBE(a, b, c) is equivalent to GROUPING SETS( (a,b,c), (a,b), (a,c), (b,c), (a), (b), (c), () ).

• GROUP BY ROLLUP(a,b,c) is equivalent to GROUPING SETS( (a,b,c), (a,b), (a), () ).

CUBE performs a full combination of GROUP BY columns, that is, N columns will generate 2 ^ N combinations. So, it is not recommended that N exceeds 5.

GROUPING() and GROUPING_ID()

As mentioned earlier, the system uses NULL as a placeholder in the GROUPING SETS result. When this happens, the placeholder NULL cannot be distinguished from the real NULL in the data. To solve this problem, MaxCompute provides the GROUPING function.

The GROUPING function takes a column name as a parameter. If the row corresponding to the result is aggregated by using the parameter column, 0 is returned. It indicates that NULL is from the input data. Otherwise, 1 is returned. It indicates that NULL is a placeholder in GROUPING SETS.

In addition, MaxCompute provides the GROUPING_ID function. This function accepts one or more column names as parameters. As a result, the GROUPING results of these columns are formed into integers by using BitMap. For example:

SELECT 
    a,b,c ,COUNT(*),
    GROUPING(a) ga, GROUPING(b) gb, GROUPING(c) gc, GROUPING_ID(a,b,c) groupingid
FROM VALUES (1,2,3) as t(a,b,c)
GROUP BY CUBE(a,b,c);

Result:

By default, the columns that are not used in GROUP BY are filled with NULL. We can use the GROUPING function to output more meaningful values. For example:

SELECT 
  IF(GROUPING(os) == 0, os, 'ALL') as os,
  IF(GROUPING(device) == 0, device, 'ALL') as device, 
  IF(GROUPING(city) == 0, city, 'ALL') as city ,
  COUNT(*) as count
FROM requests
GROUP BY os, device, city GROUPING SETS((os, device), (city), ());

Output result:

Precautions

CUBE performs a complete combination of GROUP BY columns, which means that N columns will produce 2^N combinations. Currently, we have set a maximum limit of 13 columns for GROUP BY.

Summary

GROUPING SETS enhance the aggregation capabilities of GROUP BY, providing better usability, compatibility, and performance to meet your needs. SQL experts will notice that most of the mentioned features are supported by standard SQL.

MaxCompute will continue to improve its compatibility with standard SQL and commonly used industry products. Stay Tune.

>> Next article

Community

MaxCompute Unleashed - Part 7: Grouping Set, Cube and Rollup

Feature Introduction

CUBE and ROLLUP

GROUPING() and GROUPING_ID()

Precautions

Summary

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

MaxCompute

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink