This topic describes the Empirical Probability Density Chart component provided by Machine Learning Studio.
- Empirical distribution function
If accurate parametric distribution cannot be obtained, the empirical distribution function is used to estimate probability distribution based on data and generate non-parametric distribution. For more information, see Empirical distribution function.
- Kernel density estimation function
The kernel density estimation function is used to estimate the probability density of sample data. Similar to a histogram, kernel distribution indicates the distribution of sample data. The difference is that kernel distribution is a smooth and continuous curve, whereas a histogram shows discrete data distribution. If the kernel density estimation function is used, the probability density of non-sample data points is not 0. Instead, the probability density is an overlay of weighted probability densities of all the sampling points in specific kernel distribution. The Empirical Probability Density Chart component uses Gaussian distribution as the kernel density estimation function. For more information, see Kernel density estimation function.
Configure the component
- Machine Learning Platform for AI (PAI) console
Tab Parameter Description Fields Setting Input Columns The input columns. You can select only BIGINT- or DOUBLE-type columns. Label Column The label column. Parameters Setting Number of Calculation Intervals A large value indicates high accuracy. The value of this parameter is calculated based on the range of values in each column. Tuning Cores The number of cores that you want to use for computing. The value of this parameter must be a positive integer. Memory Size The memory size of each core. Valid values: 1 to 65536. Unit: MB. - PAI command
PAI -name empirical_pdf -project algo_public -DinputTableName="test_data" -DoutputTableName="test_epdf_out" -DfeatureColNames="col0,col1,col2" -DinputTablePartitions="ds='20160101'" -Dlifecycle=1 -DintervalNum=100
Parameter Required Description Default value inputTableName Yes The name of the input table. No default value outputTableName Yes The name of the output table. No default value featureColNames Yes The names of the feature columns that you want to select from the input table for training. No default value labelColName No The name of the label column in the input table. No default value inputTablePartitions No The partitions that you want to select from the input table for training. Specify this parameter in one of the following formats: - Partition_name=value
- Multi-level partition: name1=value1/name2=value2
Note If you specify multiple partitions, separate them with commas (,).No default value intervalNum No The number of calculation intervals. A large number indicates a high accuracy. Valid values: [1,1E14). No default value lifecycle No The lifecycle of the output table. No default value coreNum No The number of cores that you want to use for computing. The value of this parameter must be a positive integer. Automatically allocated memSizePerCore No The memory size of each core. Valid values: 1 to 65536. Unit: MB. Automatically allocated
Example
drop table if exists epdf_test;
create table epdf_test as
select
*
from
(
select 1.0 as col1 from dual
union all
select 2.0 as col1 from dual
union all
select 3.0 as col1 from dual
union all
select 4.0 as col1 from dual
union all
select 5.0 as col1 from dual
) tmp;
PAI -name empirical_pdf
-project algo_public
-DinputTableName=epdf_test
-DoutputTableName=epdf_test_out
-DfeatureColNames=col1;
- Input
You can select multiple columns to calculate. You can also select label columns and group these columns by label value. For example, the label columns contain the values 0 and 1. The columns are divided into two groups: label=0 and label=1. Then, the probability density of each group is provided.Note A maximum of 100 label columns can be specified.
- Output
A diagram and a result table are generated. The following table lists the columns that are contained in the result table. If no label columns are specified, NULL is generated for the label column in the output table.
Column name Data type colName STRING label STRING x DOUBLE pdf DOUBLE Output table+------------+------------+------------+------------+ | colname | label | x | pdf | +------------+------------+------------+------------+ | col1 | NULL | 1.0 | 0.12775155176809325 | | col1 | NULL | 1.0404050505050506 | 0.1304256933829622 | | col1 | NULL | 1.0808101010101012 | 0.13306325897429525 | | col1 | NULL | 1.1212151515151518 | 0.1356613897616418 | | col1 | NULL | 1.1616202020202024 | 0.1382173796574596 | | col1 | NULL | 1.202025252525253 | 0.1407286844875733 | | col1 | NULL | 1.2424303030303037 | 0.14319293014274642 | | col1 | NULL | 1.2828353535353543 | 0.14560791960033242 | | col1 | NULL | 1.3232404040404049 | 0.14797163876379316 | | col1 | NULL | 1.3636454545454555 | 0.1502822610772349 | | col1 | NULL | 1.404050505050506 | 0.1525381508819247 | | col1 | NULL | 1.4444555555555567 | 0.1547378654919243 | | col1 | NULL | 1.4848606060606073 | 0.1568801559764068 | | col1 | NULL | 1.525265656565658 | 0.15896396664681753 | | col1 | NULL | 1.5656707070707085 | 0.16098843325768245 | | col1 | NULL | 1.6060757575757592 | 0.1629528799404685 | | col1 | NULL | 1.6464808080808098 | 0.16485681490034038 | | col1 | NULL | 1.6868858585858604 | 0.16669992491584543 | | col1 | NULL | 1.727290909090911 | 0.16848206869138338 | | col1 | NULL | 1.7676959595959616 | 0.17020326912168932 | | col1 | NULL | 1.8081010101010122 | 0.17186370453638117 | | col1 | NULL | 1.8485060606060628 | 0.17346369900080946 | | col1 | NULL | 1.8889111111111134 | 0.17500371175692428 | | col1 | NULL | 1.929316161616164 | 0.17648432589456017 | | col1 | NULL | 1.9697212121212146 | 0.17790623634938396 | | col1 | NULL | 2.0101262626262653 | 0.1792702373286898 | | col1 | NULL | 2.050531313131316 | 0.18057720927022053 | | col1 | NULL | 2.0909363636363665 | 0.18182810544221673 | | col1 | NULL | 2.131341414141417 | 0.18302393829491406 | | col1 | NULL | 2.1717464646464677 | 0.18416576567472337 | | col1 | NULL | 2.2121515151515183 | 0.1852546770123305 | | col1 | NULL | 2.252556565656569 | 0.18629177959496213 | | col1 | NULL | 2.2929616161616195 | 0.18727818503109434 | | col1 | NULL | 2.33336666666667 | 0.18821499601297229 | | col1 | NULL | 2.3737717171717208 | 0.18910329347850022 | | col1 | NULL | 2.4141767676767714 | 0.18994412426940221 | | col1 | NULL | 2.454581818181822 | 0.19073848937711185 | | col1 | NULL | 2.4949868686868726 | 0.19148733286168018 | | col1 | NULL | 2.535391919191923 | 0.1921915315221827 | | col1 | NULL | 2.575796969696974 | 0.19285188538972659 | | col1 | NULL | 2.6162020202020244 | 0.19346910910630113 | | col1 | NULL | 2.656607070707075 | 0.19404382424446043 | | col1 | NULL | 2.6970121212121256 | 0.1945765526142701 | | col1 | NULL | 2.7374171717171762 | 0.19506771059517916 | | col1 | NULL | 2.777822222222227 | 0.19551760452158667 | | col1 | NULL | 2.8182272727272775 | 0.19592642714194602 | | col1 | NULL | 2.858632323232328 | 0.1962942551623821 | | col1 | NULL | 2.8990373737373787 | 0.1966210478770638 | | col1 | NULL | 2.9394424242424293 | 0.1969066468790639 | | col1 | NULL | 2.97984747474748 | 0.19715077683721793 | | col1 | NULL | 3.0202525252525305 | 0.19735304731663747 | | col1 | NULL | 3.060657575757581 | 0.19751295561309964 | | col1 | NULL | 3.1010626262626317 | 0.19762989056457925 | | col1 | NULL | 3.1414676767676823 | 0.19770313729675995 | | col1 | NULL | 3.181872727272733 | 0.19773188285349683 | | col1 | NULL | 3.2222777777777836 | 0.19771522265793107 | | col1 | NULL | 3.262682828282834 | 0.19765216774530828 | | col1 | NULL | 3.303087878787885 | 0.19754165270453194 | | col1 | NULL | 3.3434929292929354 | 0.19738254426210697 | | col1 | NULL | 3.383897979797986 | 0.19717365043938664 | | col1 | NULL | 3.4243030303030366 | 0.19691373021193162 | | col1 | NULL | 3.4647080808080872 | 0.1966015035982942 | | col1 | NULL | 3.505113131313138 | 0.19623566210464843 | | col1 | NULL | 3.5455181818181885 | 0.19581487945135703 | | col1 | NULL | 3.585923232323239 | 0.19533782250778076 | | col1 | NULL | 3.6263282828282897 | 0.1948031623623475 | | col1 | NULL | 3.6667333333333403 | 0.1942095854560816 | | col1 | NULL | 3.707138383838391 | 0.19355580470939734 | | col1 | NULL | 3.7475434343434415 | 0.19284057057394655 | | col1 | NULL | 3.787948484848492 | 0.19206268194364004 | | col1 | NULL | 3.8283535353535427 | 0.19122099686158253 | | col1 | NULL | 3.8687585858585933 | 0.19031444296253852 | | col1 | NULL | 3.909163636363644 | 0.1893420275936375 | | col1 | NULL | 3.9495686868686946 | 0.18830284755928747 | | col1 | NULL | 3.989973737373745 | 0.1871960984396676 | | col1 | NULL | 4.030378787878796 | 0.18602108343567092 | | col1 | NULL | 4.070783838383846 | 0.18477722169674377 | | col1 | NULL | 4.111188888888897 | 0.1834640560916829 | | col1 | NULL | 4.151593939393948 | 0.1820812603860928 | | col1 | NULL | 4.191998989898998 | 0.18062864579383914 | | col1 | NULL | 4.232404040404049 | 0.179106166873458 | | col1 | NULL | 4.272809090909099 | 0.17751392674406796 | | col1 | NULL | 4.31321414141415 | 0.17585218159888508 | | col1 | NULL | 4.353619191919201 | 0.17412134449794325 | | col1 | NULL | 4.394024242424251 | 0.1723219884250765 | | col1 | NULL | 4.434429292929302 | 0.17045484859762067 | | col1 | NULL | 4.4748343434343525 | 0.16852082402064342 | | col1 | NULL | 4.515239393939403 | 0.1665209782808102 | | col1 | NULL | 4.555644444444454 | 0.16445653957824907 | | col1 | NULL | 4.596049494949504 | 0.16232889999798905 | | col1 | NULL | 4.636454545454555 | 0.16013961402571825 | | col1 | NULL | 4.6768595959596055 | 0.1578903963157465 | | col1 | NULL | 4.717264646464656 | 0.15558311872216193 | | col1 | NULL | 4.757669696969707 | 0.1532198066072439 | | col1 | NULL | 4.798074747474757 | 0.1508026344442397 | | col1 | NULL | 4.838479797979808 | 0.14833392073462115 | | col1 | NULL | 4.878884848484859 | 0.14581612226291346 | | col1 | NULL | 4.919289898989909 | 0.1432518277151203 | | col1 | NULL | 4.95969494949496 | 0.1406437506896507 | | col1 | NULL | 5.00010000000001 | 0.13799472213247665 | +------------+------------+------------+------------+