Calculate approximate percentiles for large datasets using PERCENTILE_APPROX - MaxCompute

The PERCENTILE_APPROX function is used to calculate approximate percentiles in a large dataset. You can use this function especially when exact calculation has an excessively high cost or is infeasible. The PERCENTILE_APPROX function sorts the values in a specified column in ascending order and returns the value of the p × 100th percentile of the column. This function is an additional function of MaxCompute V2.0. This topic describes the syntax of the PERCENTILE_APPROX function and provides examples on how to use this function.

Usage notes

MaxCompute V2.0 provides additional functions. If the functions that you use involve new data types, including TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, and BINARY, you must execute a statement to enable the MaxCompute V2.0 data type edition:
- Session level: Add set odps.sql.type.system.odps2=true; before the SQL statement that you want to execute, and commit and execute them together.
- Project level: The project owner can enable the MaxCompute V2.0 data type edition for the project based on the project requirements. Statement:
```
setproject odps.sql.type.system.odps2=true;
```
  The configuration takes effect after 10 to 15 minutes.
  For more information about setproject, see Project operations. For more information about the precautions that you must take when you enable the MaxCompute V2.0 data type edition at the project level, see Data type editions.
If you use an SQL statement that includes multiple aggregate functions and the project resources are insufficient, memory overflow may occur. We recommend that you optimize the SQL statement or purchase computing resources based on your business requirements.

Syntax

double percentile_approx (double <colname>[, double <weight>], <p> [, <B>])
-- Return multiple approximate percentiles as an array. 
array<double> percentile_approx (double <colname>
                                 [, double <weight>],
                                 array(<p1> [, <p2>...])
                                 [, <B>])

Description

The values in the column specified for percentile_approx are numbered from 1. For example, the p × 100th percentile of a column that contains n rows of data is calculated. The PERCENTILE_APPROX function first sorts the values in the column in ascending order and then calculates the required percentile. In this section, to help you better understand the calculation logic, arr is used to represent the array of the sorted values in the column, and res is used to represent the return value of the function. The calculation formula of res varies based on the index that is calculated by using the following formula: index = n × p.

If the index ≤ 1 condition is met, res is calculated based on the following formula: res = arr[0].
If the index >= n - 1 condition is met, res is calculated based on the following formula: res = arr[n-1].
If the 1 < index < n - 1 condition is met, diff is first calculated based on the following formula: diff = index + 0.5 - ceil(index).
- If the abs(diff) < 0.5 condition is met, res is calculated based on the following formula: res = arr[ceil(index) - 1].
- If the abs(diff) = 0.5 condition is met, res is calculated based on the following formula: res = arr[index - 1] + (arr[index] - arr[index - 1]) × 0.5.
- The value of abs(diff) cannot be greater than 0.5.

For example, the col column contains values of 100, 200, 300, and 400, and the sequence numbers of the values are 1, 2, 3, and 4. The 25th, 50th, and 75th percentiles of this column are calculated based on the following formulas:

percentile_approx(col, 0.25) = 100 (index = 1)
percentile_approx(col, 0.5) = 200 + (300 - 200) × 0.5 = 250 (index = 2)
percentile_approx(col, 0.75) = 400 (index = 3)

Note

Differences between PERCENTILE_APPROX and PERCENTILE:

PERCENTILE_APPROX is used to calculate an approximate percentile, and PERCENTILE is used to calculate an exact percentile. If the amount of data is large, PERCENTILE may fail to run due to memory limits, but PERCENTILE_APPROX does not have this issue.

Parameters

colname: required. The name of a column, which is of the DOUBLE type.
weight: optional. The weight value of data in each row. The value is of the DOUBLE type.
p: required. The value of which you want to obtain the approximate percentile. Valid values: [0.0,1.0].
B: the accuracy of the return value. A higher accuracy indicates a more accurate value. If you do not specify this parameter, 10000 is used by default.

Return value

A value of the DOUBLE or ARRAY type is returned. The return value varies based on the following rules:

If the value of colname is null, the row that contains this value is not used for calculation.
If the value of p or B is null, an error is returned.

Sample data

This section provides sample source data and examples for you to understand how to use the functions. Create a table named emp and insert the sample data into the table. Sample statement:

create table if not exists emp
   (empno bigint,
    ename string,
    job string,
    mgr bigint,
    hiredate datetime,
    sal bigint,
    comm bigint,
    deptno bigint);
tunnel upload emp.txt emp; -- Replace emp.txt with the actual path (path and name) to which you upload the data file.

The emp.txt file contains the following sample data:

7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10

Examples

Example 1: Calculate the 30th percentile in the sal column. Sample statement:

SELECT percentile_approx(sal, 0.3) FROM emp;

The following result is returned:

+------------+
| _c0        |
+------------+
| 1300.0     |
+------------+

Example 2: Use this function with GROUP BY to group all employees based on the deptno column and calculate the 30th percentile of employees in each group in the sal column. Sample statement:

SELECT deptno, percentile_approx(sal, 0.3) FROM emp GROUP BY deptno;

The following result is returned:

+------------+------------+
| deptno     | _c1        |
+------------+------------+
| 10         | 1300.0     |
| 20         | 1100.0     |
| 30         | 1250.0     |
+------------+------------+

Example 3: Use this function with GROUP BY to group all employees based on the deptno column and calculate the 30th, 50th, and 80th percentiles of employees in each group in the sal column. Sample statement:

SET odps.sql.type.system.odps2=true;
SELECT deptno, percentile_approx(sal, array(0.3, 0.5, 0.8), 1000) FROM emp GROUP BY deptno;

The following result is returned:

+------------+------+
| deptno     | _c1  |
+------------+------+
| 10         | [1300, 2450, 5000] |
| 20         | [1100, 2975, 3000] |
| 30         | [1250, 1375, 1600] |
+------------+------+

Example 4 (example with weight specified): Use this function with GROUP BY to group all employees based on the deptno column and calculate the 30th, 50th, and 80th percentiles of employees in each group in the sal column. Sample statement:

SELECT deptno, percentile_approx(sal, deptno, array(0.3, 0.5, 0.8), 1000)
 FROM emp GROUP BY deptno;

The following result is returned:

+------------+------+
| deptno     | _c1  |
+------------+------+
| 10         | [1645, 2450, 5000] |
| 20         | [1100, 2975, 3000] |
| 30         | [1250, 1375, 1975] |
+------------+------+

Related functions

PERCENTILE_APPROX is an aggregate function. For more information about the functions that are used to calculate the average value of multiple input records and to aggregate parameters, see Aggregate functions.