Percentile is a measure used in statistics to calculate the percentile of data. When a set of data is ordered from the smallest to largest and is divided into 100 groups, the percentile indicates the value below which a given percentage of data falls.
Background information
The system can calculate only the percentiles of data of the BIGINT, DOUBLE, or DATETIME type.
Empty columns are skipped when the percentile is calculated. If all of the columns are empty, an error is returned.
You can specify multiple columns of data in the colName parameter.
Configure the component
You can use one of the following methods to configure the Percentile component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Percentile component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
Parameters Setting | Input Columns | Click Select Column to select input columns. |
Tuning | Number of Cores | The number of cores. |
Memory Size per Core | The memory size of each core. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name Percentile
-project algo_public
-DinputTableName=maple_test_percentile_3col_input
-DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
Parameter | Description | Required |
inputTableName | The name of the input table. | Yes |
outputTableName | The name of the output table. | Yes |
colName | The names of columns to be calculated. By default, all columns are selected. Note Separate the names of multiple columns with commas (,). | No |
inputPartitions | The partitions in the input table. By default, all partitions are selected.
| No |
predictInputTableName | The name of the prediction table. After you set this parameter, the prediction result can be generated. | No |
predictInputTablePartitions | The partitions in the input prediction table. | No |
predictSelectedColNames | The names of the columns selected from the prediction table. By default, all the columns in the prediction table are selected. The column names must be the same as the column names in a training table. | No |
predictSelectedOriginalColNames | The names of the columns whose data you want to retain. By default, all columns are selected. Separate the names of multiple columns with commas (,). | No |
predictOutputTableName | The name of the output prediction table. This parameter is used with the predictInputTableName parameter. | No |
lifecycle | The lifecycle of the output table. By default, the output table has no lifecycle. Note The value must be a positive integer. | No |
coreNum | The number of cores. Valid values: [1,9999]. This parameter is used with the memSizePerCore parameter. Note The value must be a positive integer. | No |
memSizePerCore | The memory size of each core. Unit: MB. Valid values: [1024,64 × 1024]. Note The value must be a positive integer. | No |
Example
Input table
col0:double (1000 rows)
col1:bigint (100 rows)
col2:bigint (300 rows)
962
88
Tue Oct 15 00:26:40 CST 1974
218
99
Thu Jan 04 20:53:20 CST 1973
565
44
Sat Mar 09 02:40:00 CST 1974
314
68
Mon Aug 11 22:40:00 CST 1975
583
13
Sat Aug 23 12:26:40 CST 1975
615
87
Tue May 25 14:13:20 CST 1971
70
53
Fri Mar 23 09:20:00 CST 1979
929
63
Mon Jul 03 16:26:40 CST 1972
249
48
Thu Mar 15 07:33:20 CST 1973
428
62
Wed Mar 17 03:33:20 CST 1971
119
1
Thu Jun 26 15:33:20 CST 1975
756
27
Mon Jan 30 17:20:00 CST 1978
490
75
Wed Dec 11 21:20:00 CST 1974
957
12
Sun Jul 05 12:26:40 CST 1970
80
22
Wed Oct 04 06:40:00 CST 1972
681
57
Wed Nov 03 15:06:40 CST 1971
13
95
Sat Sep 12 23:06:40 CST 1970
PAI command
PAI -name Percentile -project algo_public -DinputTableName=maple_test_percentile_3col_input -DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
Output table
quantile:bigint
col0:double
col1:bigint
col2:datetime
0
0.0
0
Thu Jan 01 08:00:00 CST 1970
1
9.0
0
Sat Jan 24 11:33:20 CST 1970
2
19.0
1
Sat Feb 28 04:53:20 CST 1970
3
29.0
2
Fri Apr 03 22:13:20 CST 1970
4
39.0
3
Fri May 08 15:33:20 CST 1970
5
49.0
4
Fri Jun 12 08:53:20 CST 1970
6
59.0
5
Fri Jul 17 02:13:20 CST 1970
7
69.0
6
Thu Aug 20 19:33:20 CST 1970
8
79.0
7
Thu Sep 24 12:53:20 CST 1970
9
89.0
8
Thu Oct 29 06:13:20 CST 1970
10
99.0
9
Wed Dec 02 23:33:20 CST 1970
11
109.0
10
Wed Jan 06 16:53:20 CST 1971
12
119.0
11
Wed Feb 10 10:13:20 CST 1971
13
129.0
12
Wed Mar 17 03:33:20 CST 1971
14
139.0
13
Tue Apr 20 20:53:20 CST 1971
15
149.0
14
Tue May 25 14:13:20 CST 1971
16
159.0
15
Tue Jun 29 07:33:20 CST 1971
...
...
...
...
84
839.0
83
Thu Dec 15 10:13:20 CST 1977
85
849.0
84
Thu Jan 19 03:33:20 CST 1978
86
859.0
85
Wed Feb 22 20:53:20 CST 1978
87
869.0
86
Wed Mar 29 14:13:20 CST 1978
88
879.0
87
Wed May 03 07:33:20 CST 1978
89
889.0
88
Wed Jun 07 00:53:20 CST 1978
90
899.0
89
Tue Jul 11 18:13:20 CST 1978
91
909.0
90
Tue Aug 15 11:33:20 CST 1978
92
919.0
91
Tue Sep 19 04:53:20 CST 1978
93
929.0
92
Mon Oct 23 22:13:20 CST 1978
94
939.0
93
Mon Nov 27 15:33:20 CST 1978
95
949.0
94
Mon Jan 01 08:53:20 CST 1979
96
959.0
95
Mon Feb 05 02:13:20 CST 1979
97
969.0
96
Sun Mar 11 19:33:20 CST 1979
98
979.0
97
Sun Apr 15 12:53:20 CST 1979
99
989.0
98
Sun May 20 06:13:20 CST 1979
100
999.0
99
Sat Jun 23 23:33:20 CST 1979