All Products
Search
Document Center

Platform For AI:Empirical Probability Density Chart

Last Updated:Nov 28, 2024

Empirical Probability Density Chart is a non-parametric method for estimating and visualizing the probability density distribution of data. Empirical Probability Density Chart smooths sample data to provide an intuitive view of the distribution's characteristics and trends, making it useful for exploratory data analysis and distribution hypothesis testing.

Algorithm description

The empirical probability density chart algorithm uses Kernel Density Estimation (KDE) to estimate the probability density of sample data. While it serves a similar purpose to a histogram in describing data distribution, KDE differs by producing a continuous smooth distribution curve. This is achieved by superimposing the kernel function over each data point, as opposed to the discrete nature of histograms. Specifically, the algorithm calculates the probability density for non-sample data points by a weighted superposition of the sample points' probability densities under the Gaussian kernel, resulting in a smooth curve.

Configure the component

Method 1: Configure the component on the pipeline page

Add an Empirical Probability Density Chart component on the pipeline page and configure the following parameters:

Category

Parameter

Description

Fields Setting

Input Columns

The input columns. You can select only columns of the BIGINT or DOUBLE data type.

Label Column

The label column.

If you configure this parameter, the input columns are aggregated based on the values of the label column. For example, if a label column has two values (0 and 1), two results are returned.

Parameter Settings

Number of Calculation Intervals

The number of calculation intervals. A greater value indicates higher accuracy. The value of this parameter is calculated based on the range of values in each column.

Tuning

Core Number

The number of cores that you want to use. The value must be a positive integer.

Memory Size

The memory size of each core. Valid values: 1 to 65536. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.

PAI -name empirical_pdf
    -project algo_public
    -DinputTableName="test_data"
    -DoutputTableName="test_epdf_out"
    -DfeatureColNames="col0,col1,col2"
    -DinputTablePartitions="ds='20160101'"
    -Dlifecycle=1
    -DintervalNum=100

Parameter

Required

Default Value

Description

inputTableName

Yes

None

The name of the input table.

outputTableName

Yes

None

The name of the output table.

featureColNames

Yes

None

The feature columns that are selected from the input table for training.

labelColName

No

None

The name of the label column in the input table.

inputTablePartitions

No

None

The partitions of the input table to be used in training. Supported formats include:

  • partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate the partitions with commas (,). For example, name1=value1,name2=value2.

intervalNum

No

None

The number of calculation intervals. A greater value indicates higher accuracy. Valid values: [1,1E14).

lifecycle

No

None

The lifecycle of the table.

coreNum

No

Determined by the system

The number of cores that you want to use. The value must be a positive integer.

memSizePerCore

No

Determined by the system

The memory size of each core. Valid values: 1 to 65536. Unit: MB.

Examples

  1. Add an SQL script component, deselect Use Script Mode and Whether the system adds a create table statement. Enter the following SQL statements.

        drop table if exists epdf_test;
        create table epdf_test as
        select
          *
        from
        (
          select 1.0 as col1
            union all
          select 2.0 as col1
            union all
          select 3.0 as col1
            union all
          select 4.0 as col1
            union all
          select 5.0 as col1
        ) tmp;
  2. Add another SQL script component, deselect Use Script Mode and Whether the system adds a create table statement. Enter the following PAI commands and connect the components from Steps 1 and 2.

    drop table if exists ${o1};
    PAI -name empirical_pdf
        -project algo_public
        -DinputTableName=epdf_test
        -DoutputTableName=${o1}
        -DfeatureColNames=col1;
  3. Click the image icon in the upper left corner to run the pipeline.

  4. Right-click the SQL Script component created in Step 2 and choose View Data > SQL Script Output to view the training results.

    | colname | label | x                  | pdf                 |
    | ------- | ----- | ------------------ | ------------------- |
    | col1    |       | 1.0                | 0.12775155176809325 |
    | col1    |       | 1.0404050505050506 | 0.1304256933829622  |
    | col1    |       | 1.0808101010101012 | 0.13306325897429525 |
    | col1    |       | 1.1212151515151518 | 0.1356613897616418  |
    | col1    |       | 1.1616202020202024 | 0.1382173796574596  |
    | col1    |       | 1.202025252525253  | 0.1407286844875733  |
    | col1    |       | 1.2424303030303037 | 0.14319293014274642 |
    | col1    |       | 1.2828353535353543 | 0.14560791960033242 |
    | col1    |       | 1.3232404040404049 | 0.14797163876379316 |
    | col1    |       | 1.3636454545454555 | 0.1502822610772349  |
    | col1    |       | 1.404050505050506  | 0.1525381508819247  |
    | col1    |       | 1.4444555555555567 | 0.1547378654919243  |
    | col1    |       | 1.4848606060606073 | 0.1568801559764068  |
    | col1    |       | 1.525265656565658  | 0.15896396664681753 |
    | col1    |       | 1.5656707070707085 | 0.16098843325768245 |
    | col1    |       | 1.6060757575757592 | 0.1629528799404685  |
    | col1    |       | 1.6464808080808098 | 0.16485681490034038 |
    | col1    |       | 1.6868858585858604 | 0.16669992491584543 |
    | col1    |       | 1.727290909090911  | 0.16848206869138338 |
    | col1    |       | 1.7676959595959616 | 0.17020326912168932 |
    | col1    |       | 1.8081010101010122 | 0.17186370453638117 |
    | col1    |       | 1.8485060606060628 | 0.17346369900080946 |
    | col1    |       | 1.8889111111111134 | 0.17500371175692428 |
    | col1    |       | 1.929316161616164  | 0.17648432589456017 |
    | col1    |       | 1.9697212121212146 | 0.17790623634938396 |
    | col1    |       | 2.0101262626262653 | 0.1792702373286898  |
    | col1    |       | 2.050531313131316  | 0.18057720927022053 |
    | col1    |       | 2.0909363636363665 | 0.18182810544221673 |
    | col1    |       | 2.131341414141417  | 0.18302393829491406 |
    | col1    |       | 2.1717464646464677 | 0.18416576567472337 |
    | col1    |       | 2.2121515151515183 | 0.1852546770123305  |
    | col1    |       | 2.252556565656569  | 0.18629177959496213 |
    | col1    |       | 2.2929616161616195 | 0.18727818503109434 |
    | col1    |       | 2.33336666666667   | 0.18821499601297229 |
    | col1    |       | 2.3737717171717208 | 0.18910329347850022 |
    | col1    |       | 2.4141767676767714 | 0.18994412426940221 |
    | col1    |       | 2.454581818181822  | 0.19073848937711185 |
    | col1    |       | 2.4949868686868726 | 0.19148733286168018 |
    | col1    |       | 2.535391919191923  | 0.1921915315221827  |
    | col1    |       | 2.575796969696974  | 0.19285188538972659 |
    | col1    |       | 2.6162020202020244 | 0.19346910910630113 |
    | col1    |       | 2.656607070707075  | 0.19404382424446043 |
    | col1    |       | 2.6970121212121256 | 0.1945765526142701  |
    | col1    |       | 2.7374171717171762 | 0.19506771059517916 |
    | col1    |       | 2.777822222222227  | 0.19551760452158667 |
    | col1    |       | 2.8182272727272775 | 0.19592642714194602 |
    | col1    |       | 2.858632323232328  | 0.1962942551623821  |
    | col1    |       | 2.8990373737373787 | 0.1966210478770638  |
    | col1    |       | 2.9394424242424293 | 0.1969066468790639  |
    | col1    |       | 2.97984747474748   | 0.19715077683721793 |
    | col1    |       | 3.0202525252525305 | 0.19735304731663747 |
    | col1    |       | 3.060657575757581  | 0.19751295561309964 |
    | col1    |       | 3.1010626262626317 | 0.19762989056457925 |
    | col1    |       | 3.1414676767676823 | 0.19770313729675995 |
    | col1    |       | 3.181872727272733  | 0.19773188285349683 |
    | col1    |       | 3.2222777777777836 | 0.19771522265793107 |
    | col1    |       | 3.262682828282834  | 0.19765216774530828 |
    | col1    |       | 3.303087878787885  | 0.19754165270453194 |
    | col1    |       | 3.3434929292929354 | 0.19738254426210697 |
    | col1    |       | 3.383897979797986  | 0.19717365043938664 |
    | col1    |       | 3.4243030303030366 | 0.19691373021193162 |
    | col1    |       | 3.4647080808080872 | 0.1966015035982942  |
    | col1    |       | 3.505113131313138  | 0.19623566210464843 |
    | col1    |       | 3.5455181818181885 | 0.19581487945135703 |
    | col1    |       | 3.585923232323239  | 0.19533782250778076 |
    | col1    |       | 3.6263282828282897 | 0.1948031623623475  |
    | col1    |       | 3.6667333333333403 | 0.1942095854560816  |
    | col1    |       | 3.707138383838391  | 0.19355580470939734 |
    | col1    |       | 3.7475434343434415 | 0.19284057057394655 |
    | col1    |       | 3.787948484848492  | 0.19206268194364004 |
    | col1    |       | 3.8283535353535427 | 0.19122099686158253 |
    | col1    |       | 3.8687585858585933 | 0.19031444296253852 |
    | col1    |       | 3.909163636363644  | 0.1893420275936375  |
    | col1    |       | 3.9495686868686946 | 0.18830284755928747 |
    | col1    |       | 3.989973737373745  | 0.1871960984396676  |
    | col1    |       | 4.030378787878796  | 0.18602108343567092 |
    | col1    |       | 4.070783838383846  | 0.18477722169674377 |
    | col1    |       | 4.111188888888897  | 0.1834640560916829  |
    | col1    |       | 4.151593939393948  | 0.1820812603860928  |
    | col1    |       | 4.191998989898998  | 0.18062864579383914 |
    | col1    |       | 4.232404040404049  | 0.179106166873458   |
    | col1    |       | 4.272809090909099  | 0.17751392674406796 |
    | col1    |       | 4.31321414141415   | 0.17585218159888508 |
    | col1    |       | 4.353619191919201  | 0.17412134449794325 |
    | col1    |       | 4.394024242424251  | 0.1723219884250765  |
    | col1    |       | 4.434429292929302  | 0.17045484859762067 |
    | col1    |       | 4.4748343434343525 | 0.16852082402064342 |
    | col1    |       | 4.515239393939403  | 0.1665209782808102  |
    | col1    |       | 4.555644444444454  | 0.16445653957824907 |
    | col1    |       | 4.596049494949504  | 0.16232889999798905 |
    | col1    |       | 4.636454545454555  | 0.16013961402571825 |
    | col1    |       | 4.6768595959596055 | 0.1578903963157465  |
    | col1    |       | 4.717264646464656  | 0.15558311872216193 |
    | col1    |       | 4.757669696969707  | 0.1532198066072439  |
    | col1    |       | 4.798074747474757  | 0.1508026344442397  |
    | col1    |       | 4.838479797979808  | 0.14833392073462115 |
    | col1    |       | 4.878884848484859  | 0.14581612226291346 |
    | col1    |       | 4.919289898989909  | 0.1432518277151203  |
    | col1    |       | 4.95969494949496   | 0.1406437506896507  |
    | col1    |       | 5.00010000000001   | 0.13799472213247665 |

    Column name

    Type

    Description

    colName

    string

    The input column name.

    label

    string

    Indicates the label column. If not specified, the label is empty in the output.

    x

    double

    Represents the x-axis value in the graph, which is an interpolated value rather than an actual data point.

    pdf

    double

    Represents the probability density function value.