COLLECT_SET is a MaxCompute V2.0 aggregate function that collects column values into an array of distinct elements, automatically removing duplicates.
Syntax
array collect_set(<colname>)Parameters
| Parameter | Required | Description |
|---|---|---|
colname | Yes | The name of the column to be aggregated. |
Return value
Returns an ARRAY of distinct values from the specified column. NULL values are excluded from the result.
Examples
The following examples use the emp table. To set up the table, run:
CREATE TABLE IF NOT EXISTS emp (
empno BIGINT,
ename STRING,
job STRING,
mgr BIGINT,
hiredate DATETIME,
sal BIGINT,
comm BIGINT,
deptno BIGINT
);
tunnel upload emp.txt emp; -- Replace emp.txt with the actual path to your data file.The emp.txt file contains:
7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10Example 1: Collect distinct salaries across all employees
SELECT collect_set(sal) FROM emp;Result:
+------+
| _c0 |
+------+
| [800, 950, 1100, 1250, 1300, 1500, 1600, 2450, 2850, 2975, 3000, 5000] |
+------+Example 2: Collect distinct salaries by department using GROUP BY
SELECT deptno, collect_set(sal) FROM emp GROUP BY deptno;Result:
+------------+------+
| deptno | _c1 |
+------------+------+
| 10 | [1300, 2450, 5000] |
| 20 | [800, 1100, 2975, 3000] |
| 30 | [950, 1250, 1500, 1600, 2850] |
+------------+------+Usage notes
MaxCompute V2.0 data type edition
If your query involves new data types (TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, or BINARY), enable the MaxCompute V2.0 data type edition.
Session level — add the following before your SQL statement and submit both together:
set odps.sql.type.system.odps2=true;Project level — run the following command as the project owner. The change takes effect within 10 to 15 minutes.
setproject odps.sql.type.system.odps2=true;See Project operations for details on setproject. For precautions about enabling the V2.0 data type edition at the project level, see Data type editions.
Memory overflow with multiple aggregate functions
When an SQL statement contains multiple aggregate functions and project computing resources are insufficient, memory overflow may occur. Optimize the SQL statement or purchase additional computing resources as needed.
Related functions
COLLECT_SET is an aggregate function. For other aggregate functions, see Aggregate functions.