Percentile is a measure used in statistics to calculate the percentile of data in the columns of a data table. When a set of data is ordered from the smallest to largest and is divided into 100 groups, the percentile indicates the value below which a given percentage of data falls.

Background information

  • The system can calculate only the percentiles of data of the BIGINT, DOUBLE, or DATETIME type.
  • Empty columns are skipped when the percentile is calculated. If all of the columns are empty, an error is returned.
  • You can specify multiple columns of data in the colName parameter.

Configure the component

  • Machine Learning Platform for AI console
    Tab Parameter Description
    Parameters Setting Input Columns Click Select Column to select input columns.
    Tuning Number of Cores The number of cores.
    Memory Size per Core The memory size of each core.
  • PAI command
    PAI -name Percentile
         -project algo_public
         -DinputTableName=maple_test_percentile_3col_input
         -DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
    Parameter Description Required
    inputTableName The name of the input table. Yes
    outputTableName The name of the output table. Yes
    colName The names of columns to be calculated. By default, all columns are selected.
    Note Separate the names of multiple columns with commas (,).
    No
    inputPartitions The partitions in the input table. By default, all partitions are selected.
    • Specify a single partition in the format of partition_name=value.
    • Specify multiple partitions in the format of name1=value1,name2=value2.
      Note Separate multiple partitions with commas (,).
    • Specify multi-level partitions in the format of name1=value1/name2=value2.
    No
    predictInputTableName The name of the prediction table. After you configure this parameter, the prediction result can be generated. No
    predictInputTablePartitions The partitions in the input prediction table. No
    predictSelectedColNames The names of the columns selected from the prediction table. By default, all the columns in the prediction table are selected. The column names must be the same as the column names in a training table. No
    predictSelectedOriginalColNames The names of the columns whose data you want to retain. By default, all columns are selected. Separate the names of multiple columns with commas (,). No
    predictOutputTableName The name of the output prediction table. This parameter is used with the predictInputTableName parameter. No
    lifecycle The lifecycle of the output table. By default, the output table has no lifecycle.
    Note The parameter value must be a positive integer.
    No
    coreNum The number of cores. Valid values: [1,9999]. This parameter is used with the memSizePerCore parameter.
    Note The parameter value must be a positive integer.
    No
    memSizePerCore The memory size of each core, in MB. Valid values: [1024,64 × 1024].
    Note The parameter value must be a positive integer.
    No

Example

  • Input table
    col0:double (1000 rows) col1:bigint (100 rows) col2:bigint (300 rows)
    962 88 Tue Oct 15 00:26:40 CST 1974
    218 99 Thu Jan 04 20:53:20 CST 1973
    565 44 Sat Mar 09 02:40:00 CST 1974
    314 68 Mon Aug 11 22:40:00 CST 1975
    583 13 Sat Aug 23 12:26:40 CST 1975
    615 87 Tue May 25 14:13:20 CST 1971
    70 53 Fri Mar 23 09:20:00 CST 1979
    929 63 Mon Jul 03 16:26:40 CST 1972
    249 48 Thu Mar 15 07:33:20 CST 1973
    428 62 Wed Mar 17 03:33:20 CST 1971
    119 1 Thu Jun 26 15:33:20 CST 1975
    756 27 Mon Jan 30 17:20:00 CST 1978
    490 75 Wed Dec 11 21:20:00 CST 1974
    957 12 Sun Jul 05 12:26:40 CST 1970
    80 22 Wed Oct 04 06:40:00 CST 1972
    681 57 Wed Nov 03 15:06:40 CST 1971
    13 95 Sat Sep 12 23:06:40 CST 1970
  • PAI command
     PAI -name Percentile
         -project algo_public
         -DinputTableName=maple_test_percentile_3col_input
         -DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
  • Output table
    quantile:bigint col0:double col1:bigint col2:datetime
    0 0.0 0 Thu Jan 01 08:00:00 CST 1970
    1 9.0 0 Sat Jan 24 11:33:20 CST 1970
    2 19.0 1 Sat Feb 28 04:53:20 CST 1970
    3 29.0 2 Fri Apr 03 22:13:20 CST 1970
    4 39.0 3 Fri May 08 15:33:20 CST 1970
    5 49.0 4 Fri Jun 12 08:53:20 CST 1970
    6 59.0 5 Fri Jul 17 02:13:20 CST 1970
    7 69.0 6 Thu Aug 20 19:33:20 CST 1970
    8 79.0 7 Thu Sep 24 12:53:20 CST 1970
    9 89.0 8 Thu Oct 29 06:13:20 CST 1970
    10 99.0 9 Wed Dec 02 23:33:20 CST 1970
    11 109.0 10 Wed Jan 06 16:53:20 CST 1971
    12 119.0 11 Wed Feb 10 10:13:20 CST 1971
    13 129.0 12 Wed Mar 17 03:33:20 CST 1971
    14 139.0 13 Tue Apr 20 20:53:20 CST 1971
    15 149.0 14 Tue May 25 14:13:20 CST 1971
    16 159.0 15 Tue Jun 29 07:33:20 CST 1971
    ... ... ... ...
    84 839.0 83 Thu Dec 15 10:13:20 CST 1977
    85 849.0 84 Thu Jan 19 03:33:20 CST 1978
    86 859.0 85 Wed Feb 22 20:53:20 CST 1978
    87 869.0 86 Wed Mar 29 14:13:20 CST 1978
    88 879.0 87 Wed May 03 07:33:20 CST 1978
    89 889.0 88 Wed Jun 07 00:53:20 CST 1978
    90 899.0 89 Tue Jul 11 18:13:20 CST 1978
    91 909.0 90 Tue Aug 15 11:33:20 CST 1978
    92 919.0 91 Tue Sep 19 04:53:20 CST 1978
    93 929.0 92 Mon Oct 23 22:13:20 CST 1978
    94 939.0 93 Mon Nov 27 15:33:20 CST 1978
    95 949.0 94 Mon Jan 01 08:53:20 CST 1979
    96 959.0 95 Mon Feb 05 02:13:20 CST 1979
    97 969.0 96 Sun Mar 11 19:33:20 CST 1979
    98 979.0 97 Sun Apr 15 12:53:20 CST 1979
    99 989.0 98 Sun May 20 06:13:20 CST 1979
    100 999.0 99 Sat Jun 23 23:33:20 CST 1979