In some cases, you need to execute the UNION ALL clause multiple times to aggregate and analyze data from multiple dimensions. For example, you need to aggregate column a, aggregate column b, and then aggregate columns a and b. To address this issue, you can use the GROUPING SETS clause.
GROUPING SETS
is an extension of the GROUP BY
clause in a SELECT
statement. The GROUPING SETS clause allows you to group your results in multiple
ways, without the need to use multiple SELECT
statements. This allows the MaxCompute engine to generate more efficient execution
plans with higher performance.
Example
- Prepare data.
create table requests LIFECYCLE 20 as select * from values (1, 'windows', 'PC', 'Beijing'), (2, 'windows', 'PC', 'Shijiazhuang'), (3, 'linux', 'Phone', 'Beijing'), (4, 'windows', 'PC', 'Beijing'), (5, 'ios', 'Phone', 'Shijiazhuang'), (6, 'linux', 'PC', 'Beijing'), (7, 'windows', 'Phone', 'Shijiazhuang') as t(id, os, device, city);
- Use one of the following methods to group data:
- Execute multiple
SELECT
statements.SELECT NULL, NULL, NULL, COUNT(*) FROM requests UNION ALL SELECT os, device, NULL, COUNT(*) FROM requests GROUP BY os, device UNION ALL SELECT null, null, city, COUNT(*) FROM requests GROUP BY city;
- Execute a single SELECT statement with the
GROUPING SETS
clause.SELECT os,device, city ,COUNT(*) FROM requests GROUP BY os, device, city GROUPING SETS((os, device), (city), ());
The following result is returned:+----+--------+------+------------+ | os | device | city | cnt | +----+--------+------+------------+ | NULL | NULL | NULL | 7 | | NULL | NULL | Beijing | 4 | | NULL | NULL | Shijiazhuang | 3 | | ios | Phone | NULL | 1 | | linux | PC | NULL | 1 | | linux | Phone | NULL | 1 | | windows | PC | NULL | 3 | | windows | Phone | NULL | 1 | +----+--------+------+------------+
Note NULL is used as placeholders for the expressions that are not used in GROUPING SETS. This way, you can perform UNION operations on the result sets, such as the city column in rows 1 to 5. - Execute multiple
CUBE and ROLLUP
CUBE and ROLLUP are special GROUPING SETS
functions. CUBE
lists all possible combinations of specific columns as grouping sets. ROLLUP
aggregates data by level to generate grouping sets.
GROUP BY CUBE(a, b, c)
-- Equivalent to the following statement:
GROUPING SETS((a,b,c),(a,b),(a,c),(b,c),(a),(b),(c),())
GROUP BY ROLLUP(a, b, c)
-- Equivalent to the following statement:
GROUPING SETS((a,b,c),(a,b),(a), ())
GROUP BY CUBE ( (a, b), (c, d) )
-- Equivalent to the following statement:
GROUPING SETS (
( a, b, c, d ),
( a, b ),
( c, d ),
( )
)
GROUP BY ROLLUP ( a, (b, c), d )
-- Equivalent to the following statement:
GROUPING SETS (
( a, b, c, d ),
( a, b, c ),
( a ),
( )
)
GROUP BY a, CUBE (b, c), GROUPING SETS ((d), (e))
-- Equivalent to the following statement:
GROUP BY GROUPING SETS (
(a, b, c, d), (a, b, c, e),
(a, b, d), (a, b, e),
(a, c, d), (a, c, e),
(a, d), (a, e)
)
GROUP BY grouping sets((b), (c),rollup(a,b,c))
-- Equivalent to the following statement:
GROUP BY GROUPING SETS (
(b), (c),
(a,b,c), (a,b), (a), ()
)
GROUPING and GROUPING_ID
NULL is used as placeholders in the results of GROUPING SETS
. As a result, the NULL placeholders cannot be distinguished from the NULL values.
To address this issue, MaxCompute provides the GROUPING
function. GROUPING
allows you to specify the name of a column as a parameter. If specific rows are aggregated
based on such a column, 0 is returned, which indicates that NULL is a value. Otherwise,
1 is returned, which indicates that NULL is a placeholder in GROUPING SETS
.
GROUPING_ID
allows you to specify the names of one or more columns as parameters. The results
of GROUPING
in these columns are formed into integers by using bitmap.
SELECT a,b,c ,COUNT(*),
GROUPING(a) ga, GROUPING(b) gb, GROUPING(c) gc, GROUPING_ID(a,b,c) groupingid
FROM VALUES (1,2,3) as t(a,b,c)
GROUP BY CUBE(a,b,c);
+------------+------------+------------+------------+------------+------------+------------+------------+
| a | b | c | _c3 | ga | gb | gc | groupingid |
+------------+------------+------------+------------+------------+------------+------------+------------+
| NULL | NULL | NULL | 1 | 1 | 1 | 1 | 7 |
| NULL | NULL | 3 | 1 | 1 | 1 | 0 | 6 |
| NULL | 2 | NULL | 1 | 1 | 0 | 1 | 5 |
| NULL | 2 | 3 | 1 | 1 | 0 | 0 | 4 |
| 1 | NULL | NULL | 1 | 0 | 1 | 1 | 3 |
| 1 | NULL | 3 | 1 | 0 | 1 | 0 | 2 |
| 1 | 2 | NULL | 1 | 0 | 0 | 1 | 1 |
| 1 | 2 | 3 | 1 | 0 | 0 | 0 | 0 |
+------------+------------+------------+------------+------------+------------+------------+------------+
SELECT
IF(GROUPING(os) == 0, os, 'ALL') as os,
IF(GROUPING(device) == 0, device, 'ALL') as device,
IF(GROUPING(city) == 0, city, 'ALL') as city ,
COUNT(*) as count
FROM requests
GROUP BY os, device, city GROUPING SETS((os, device), (city), ());

SELECT
a,b,c ,COUNT(*),GROUPING__ID
FROM VALUES (1,2,3) as t(a,b,c)
GROUP BY a, b, c GROUPING SETS ((a,b,c), (a));
GROUPING__ID does not contain input parameters and parentheses (()). GROUPING_ID is equivalent to GROUPING_ID(a,b,c) in MaxCompute. The order of the parameters in this function is the same as that in GROUP BY.
We recommend that you use this function for Hive 2.3.0 and later.