Empirical Probability Density Chart is a non-parametric method for estimating and visualizing the probability density distribution of data. Empirical Probability Density Chart smooths sample data to provide an intuitive view of the distribution's characteristics and trends, making it useful for exploratory data analysis and distribution hypothesis testing.
Algorithm description
The empirical probability density chart algorithm uses Kernel Density Estimation (KDE) to estimate the probability density of sample data. While it serves a similar purpose to a histogram in describing data distribution, KDE differs by producing a continuous smooth distribution curve. This is achieved by superimposing the kernel function over each data point, as opposed to the discrete nature of histograms. Specifically, the algorithm calculates the probability density for non-sample data points by a weighted superposition of the sample points' probability densities under the Gaussian kernel, resulting in a smooth curve.
Configure the component
Method 1: Configure the component on the pipeline page
Add an Empirical Probability Density Chart component on the pipeline page and configure the following parameters:
Category | Parameter | Description |
Fields Setting | Input Columns | The input columns. You can select only columns of the BIGINT or DOUBLE data type. |
Label Column | The label column. If you configure this parameter, the input columns are aggregated based on the values of the label column. For example, if a label column has two values (0 and 1), two results are returned. | |
Parameter Settings | Number of Calculation Intervals | The number of calculation intervals. A greater value indicates higher accuracy. The value of this parameter is calculated based on the range of values in each column. |
Tuning | Core Number | The number of cores that you want to use. The value must be a positive integer. |
Memory Size | The memory size of each core. Valid values: 1 to 65536. Unit: MB. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.
PAI -name empirical_pdf
-project algo_public
-DinputTableName="test_data"
-DoutputTableName="test_epdf_out"
-DfeatureColNames="col0,col1,col2"
-DinputTablePartitions="ds='20160101'"
-Dlifecycle=1
-DintervalNum=100Parameter | Required | Default Value | Description |
inputTableName | Yes | None | The name of the input table. |
outputTableName | Yes | None | The name of the output table. |
featureColNames | Yes | None | The feature columns that are selected from the input table for training. |
labelColName | No | None | The name of the label column in the input table. |
inputTablePartitions | No | None | The partitions of the input table to be used in training. Supported formats include:
Note If you specify multiple partitions, separate the partitions with commas (,). For example, name1=value1,name2=value2. |
intervalNum | No | None | The number of calculation intervals. A greater value indicates higher accuracy. Valid values: [1,1E14). |
lifecycle | No | None | The lifecycle of the table. |
coreNum | No | Determined by the system | The number of cores that you want to use. The value must be a positive integer. |
memSizePerCore | No | Determined by the system | The memory size of each core. Valid values: 1 to 65536. Unit: MB. |
Examples
Add an SQL script component, deselect Use Script Mode and Whether the system adds a create table statement. Enter the following SQL statements.
drop table if exists epdf_test; create table epdf_test as select * from ( select 1.0 as col1 union all select 2.0 as col1 union all select 3.0 as col1 union all select 4.0 as col1 union all select 5.0 as col1 ) tmp;Add another SQL script component, deselect Use Script Mode and Whether the system adds a create table statement. Enter the following PAI commands and connect the components from Steps 1 and 2.
drop table if exists ${o1}; PAI -name empirical_pdf -project algo_public -DinputTableName=epdf_test -DoutputTableName=${o1} -DfeatureColNames=col1;Click the
icon in the upper left corner to run the pipeline.Right-click the SQL Script component created in Step 2 and choose View Data > SQL Script Output to view the training results.
| colname | label | x | pdf | | ------- | ----- | ------------------ | ------------------- | | col1 | | 1.0 | 0.12775155176809325 | | col1 | | 1.0404050505050506 | 0.1304256933829622 | | col1 | | 1.0808101010101012 | 0.13306325897429525 | | col1 | | 1.1212151515151518 | 0.1356613897616418 | | col1 | | 1.1616202020202024 | 0.1382173796574596 | | col1 | | 1.202025252525253 | 0.1407286844875733 | | col1 | | 1.2424303030303037 | 0.14319293014274642 | | col1 | | 1.2828353535353543 | 0.14560791960033242 | | col1 | | 1.3232404040404049 | 0.14797163876379316 | | col1 | | 1.3636454545454555 | 0.1502822610772349 | | col1 | | 1.404050505050506 | 0.1525381508819247 | | col1 | | 1.4444555555555567 | 0.1547378654919243 | | col1 | | 1.4848606060606073 | 0.1568801559764068 | | col1 | | 1.525265656565658 | 0.15896396664681753 | | col1 | | 1.5656707070707085 | 0.16098843325768245 | | col1 | | 1.6060757575757592 | 0.1629528799404685 | | col1 | | 1.6464808080808098 | 0.16485681490034038 | | col1 | | 1.6868858585858604 | 0.16669992491584543 | | col1 | | 1.727290909090911 | 0.16848206869138338 | | col1 | | 1.7676959595959616 | 0.17020326912168932 | | col1 | | 1.8081010101010122 | 0.17186370453638117 | | col1 | | 1.8485060606060628 | 0.17346369900080946 | | col1 | | 1.8889111111111134 | 0.17500371175692428 | | col1 | | 1.929316161616164 | 0.17648432589456017 | | col1 | | 1.9697212121212146 | 0.17790623634938396 | | col1 | | 2.0101262626262653 | 0.1792702373286898 | | col1 | | 2.050531313131316 | 0.18057720927022053 | | col1 | | 2.0909363636363665 | 0.18182810544221673 | | col1 | | 2.131341414141417 | 0.18302393829491406 | | col1 | | 2.1717464646464677 | 0.18416576567472337 | | col1 | | 2.2121515151515183 | 0.1852546770123305 | | col1 | | 2.252556565656569 | 0.18629177959496213 | | col1 | | 2.2929616161616195 | 0.18727818503109434 | | col1 | | 2.33336666666667 | 0.18821499601297229 | | col1 | | 2.3737717171717208 | 0.18910329347850022 | | col1 | | 2.4141767676767714 | 0.18994412426940221 | | col1 | | 2.454581818181822 | 0.19073848937711185 | | col1 | | 2.4949868686868726 | 0.19148733286168018 | | col1 | | 2.535391919191923 | 0.1921915315221827 | | col1 | | 2.575796969696974 | 0.19285188538972659 | | col1 | | 2.6162020202020244 | 0.19346910910630113 | | col1 | | 2.656607070707075 | 0.19404382424446043 | | col1 | | 2.6970121212121256 | 0.1945765526142701 | | col1 | | 2.7374171717171762 | 0.19506771059517916 | | col1 | | 2.777822222222227 | 0.19551760452158667 | | col1 | | 2.8182272727272775 | 0.19592642714194602 | | col1 | | 2.858632323232328 | 0.1962942551623821 | | col1 | | 2.8990373737373787 | 0.1966210478770638 | | col1 | | 2.9394424242424293 | 0.1969066468790639 | | col1 | | 2.97984747474748 | 0.19715077683721793 | | col1 | | 3.0202525252525305 | 0.19735304731663747 | | col1 | | 3.060657575757581 | 0.19751295561309964 | | col1 | | 3.1010626262626317 | 0.19762989056457925 | | col1 | | 3.1414676767676823 | 0.19770313729675995 | | col1 | | 3.181872727272733 | 0.19773188285349683 | | col1 | | 3.2222777777777836 | 0.19771522265793107 | | col1 | | 3.262682828282834 | 0.19765216774530828 | | col1 | | 3.303087878787885 | 0.19754165270453194 | | col1 | | 3.3434929292929354 | 0.19738254426210697 | | col1 | | 3.383897979797986 | 0.19717365043938664 | | col1 | | 3.4243030303030366 | 0.19691373021193162 | | col1 | | 3.4647080808080872 | 0.1966015035982942 | | col1 | | 3.505113131313138 | 0.19623566210464843 | | col1 | | 3.5455181818181885 | 0.19581487945135703 | | col1 | | 3.585923232323239 | 0.19533782250778076 | | col1 | | 3.6263282828282897 | 0.1948031623623475 | | col1 | | 3.6667333333333403 | 0.1942095854560816 | | col1 | | 3.707138383838391 | 0.19355580470939734 | | col1 | | 3.7475434343434415 | 0.19284057057394655 | | col1 | | 3.787948484848492 | 0.19206268194364004 | | col1 | | 3.8283535353535427 | 0.19122099686158253 | | col1 | | 3.8687585858585933 | 0.19031444296253852 | | col1 | | 3.909163636363644 | 0.1893420275936375 | | col1 | | 3.9495686868686946 | 0.18830284755928747 | | col1 | | 3.989973737373745 | 0.1871960984396676 | | col1 | | 4.030378787878796 | 0.18602108343567092 | | col1 | | 4.070783838383846 | 0.18477722169674377 | | col1 | | 4.111188888888897 | 0.1834640560916829 | | col1 | | 4.151593939393948 | 0.1820812603860928 | | col1 | | 4.191998989898998 | 0.18062864579383914 | | col1 | | 4.232404040404049 | 0.179106166873458 | | col1 | | 4.272809090909099 | 0.17751392674406796 | | col1 | | 4.31321414141415 | 0.17585218159888508 | | col1 | | 4.353619191919201 | 0.17412134449794325 | | col1 | | 4.394024242424251 | 0.1723219884250765 | | col1 | | 4.434429292929302 | 0.17045484859762067 | | col1 | | 4.4748343434343525 | 0.16852082402064342 | | col1 | | 4.515239393939403 | 0.1665209782808102 | | col1 | | 4.555644444444454 | 0.16445653957824907 | | col1 | | 4.596049494949504 | 0.16232889999798905 | | col1 | | 4.636454545454555 | 0.16013961402571825 | | col1 | | 4.6768595959596055 | 0.1578903963157465 | | col1 | | 4.717264646464656 | 0.15558311872216193 | | col1 | | 4.757669696969707 | 0.1532198066072439 | | col1 | | 4.798074747474757 | 0.1508026344442397 | | col1 | | 4.838479797979808 | 0.14833392073462115 | | col1 | | 4.878884848484859 | 0.14581612226291346 | | col1 | | 4.919289898989909 | 0.1432518277151203 | | col1 | | 4.95969494949496 | 0.1406437506896507 | | col1 | | 5.00010000000001 | 0.13799472213247665 |Column name
Type
Description
colName
string
The input column name.
label
string
Indicates the label column. If not specified, the label is empty in the output.
x
double
Represents the x-axis value in the graph, which is an interpolated value rather than an actual data point.
pdf
double
Represents the probability density function value.