edit-icon download-icon

Statistical analysis

Last Updated: Aug 13, 2018

Contents

Percentile

Calculate the percentile for a single column of data in an existing table.

Parameter settings

Select the fields to be analyzed. Only the double and bigint types are supported.

The execution result is as follows:

image

PAI command

  1. PAI -name Percentile
  2. -project algo_public
  3. -DoutputTableName="pai_temp_666_6014_1"\
  4. -DcolName="euribor3m"
  5. -DinputTableName="bank_data";
  • name: Component name.

  • project: Project name, used to specify the space of an algorithmThe default value is algo_public. If you change the name, the system reports an error.

  • outputTableName: Result table automatically allocated after the system executes the percentile operation.

  • colName: Column for percentile calculation. Only the number type is supported.

  • inputTableName: Name of the input table.

Full table statistics

For an existing table, conduct basic statistics for the full table or conduct statistics only for selected columns.

PAI command

  1. PAI -name stat_summary
  2. -project algo_public
  3. -DinputTableName=test_data
  4. -DoutputTableName=test_summary_out
  5. -DinputTablePartitions="ds='20160101'"
  6. -DselectColNames=col0,col1,col2
  7. -Dlifecycle=1

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
outputTableName (Required) Name of the recommendation result output table NA NA
inputTablePartitions (Optional) Partitions in the input table NA “”
selectColNames (Optional) Names of columns for statistics NA “”
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum (Optional)Number of nodes, Used together with the memSizePerCore parameter Positive integer Automatically calculated by default
memSizePerCore (Optional)Memory size of each node, in MB Positive integer in the range of [100, 64*1024] Automatically calculated by default

Input format

In the input column box, select the columns that require statistics. By default, the statistics are conducted for all columns.

Output format

The fields of statistics output are as follows:

Column name Description
colname Column name
datatype Type
totalcount Total number
count Number of Non-NULL values
missingcount Number of NULL values
nancount Number of NAN values
positiveinfinitycount Number of positive infinity values
negativeinfinitycount Number of negative infinity values
min Minimum value
max Maximum value
mean Mean value
variance Variance
standarddeviation Standard deviation
standarderror Standard error
skewness Skewness
kurtosis Kurtosis
moment2 Second moment
moment3 Third moment
moment4 Fourth moment
centralmoment2 Second central moment
centralmoment3 Third central moment
centralmoment4 Fourth central moment
sum Sum
sum2 Sum of squares
sum3 Sum of cubes
sum4 Sum of biquadrates

Example

Test data

SQL statement for data generation:

  1. drop table if exists summary_test_input;
  2. create table summary_test_input as
  3. select
  4. *
  5. from
  6. (
  7. select 'a' as col1, 1 as col2, 0.001 as col3 from dual
  8. union all
  9. select 'b' as col1, 2 as col2, 100.01 as col3 from dual
  10. ) tmp;

Running command

  1. PAI -name stat_summary
  2. -project algo_public
  3. -DinputTableName=summary_test_input
  4. -DoutputTableName=summary_test_input_out
  5. -DselectColNames=col1,col2,col3
  6. -Dlifecycle=1;

Running result

  1. | colname | datatype | totalcount | count | missingcount | nancount | positiveinfinitycount | negativeinfinitycount | min | max | mean | variance | standarddeviation | standarderror | skewness | kurtosis | moment2 | moment3 | moment4 | centralmoment2 | centralmoment3 | centralmoment4 | sum | sum2 | sum3 | sum4 |
  2. | col1 | string | 2 | 2 | 0 | 0 | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
  3. | col2 | bigint | 2 | 2 | 0 | 0 | 0 | 0 | 1 | 2 | 1.5 | 0.5 | 0.7071067811865476 | 0.5 | 0 | -2 | 2.5 | 4.5 | 8.5 | 0.25 | 0 | 0.0625 | 3 | 5 | 9 | 17 |
  4. | col3 | double | 2 | 2 | 0 | 0 | 0 | 0 | 0.001 | 100.01 | 50.0055 | 5000.900040500001 | 70.71704207968544 | 50.00450000000001 | 2.327677906939552e-16 | -1.999999999999999 | 5001.000050500001 | 500150.0150005006 | 50020003.00020002 | 2500.45002025 | 2.91038304567337e-11 | 6252250.303768232 | 100.011 | 10002.000101 | 1000300.030001001 | 100040006.0004 |

Pearson coefficient

Calculate the Pearson correlation coefficient of two numeric columns in an input table or a partition and save the result to the output table.

  • The component has only two parameters: input column 1 and input column 2. Enter the names of two columns for which the Pearson correlation coefficient is calculated.

  • After running, right-click the component and choose Menu > View Analysis Report. The result is as follows:

    image

The Pearson correlation coefficient is listed in the last column.

PAI command

  1. PAI -name pearson
  2. -project algo_test
  3. -DinputTableName=wpbc
  4. -Dcol1Name=f1
  5. -Dcol2Name=f2
  6. -DoutputTableName=wpbc_pear;

Parameter description

Parameter key Description Value range Required/Optional, default value/act
inputTableName Name of the input table Table name Required
inputTablePartitions Partitions used for calculation in the input table Format: partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). All partitions in the input table
col1Name Input column 1 Column name Required
col2Name Input column 2 Column name Required
outputTableName Name of the output table Table name Required

Histogram

Calculate the histogram for a single column of data in an existing table.

Parameter settings

Select the fields to be analyzed. Only the double and bigint types are supported.

The analysis report is as follows:
image
You can adjust the step and move the slider to view the histogram.

Discrete value feature analysis

Discrete value feature analysis shows the indicators such as discrete feature distribution, gini, entropy, gini gain, information gain, and information gain ratio. Calculate gini and entropy for each discrete value and calculate gini gain, information gain, and information gain ratio for each column.

  • gini index: image

  • entropy: image

PAI command

  1. PAI
  2. -name enum_feature_selection
  3. -project algo_public
  4. -DinputTableName=enumfeautreselection_input
  5. -DlabelColName=label
  6. -DfeatureColNames=col0,col1
  7. -DenableSparse=false
  8. -DoutputCntTableName=enumfeautreselection_output_cntTable
  9. -DoutputValueTableName=enumfeautreselection_output_valuetable
  10. -DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;

Parameter description

Parameter key Description Value range Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Names of the selected partitions in the input table NA All partitions are selected by default.
featureColNames (Optional) Selected column names in the input table NA By default, all columns are selected except the label column. If the input table format is KV, all the string-type columns are selected by default.
labelColName (Required) Name of the label column NA NA
enableSparse (Optional) Indicates whether the input table is in the KV format. NA Table by default
kvFeatureColNames (Optional) Features in the KV format NA The entire table is selected by default.
kvDelimiter (Optional) Delimiter used between key-value pairs NA Default value: “:”
itemDelimiter (Optional) Delimiter used between keys and values NA Default value: “,”
outputCntTableName (Required) Output table of enumeration value distribution number of discrete values NA NA
outputValueTableName (Required) Output table of gini and entropy of discrete features NA NA
outputEnumValueTableName (Required) Output table of gini and entropy of discrete feature enumeration values NA NA
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum (Optional)Number of nodes, Used together with the memSizePerCore parameter Positive integer Automatically calculated by default
memSizePerCore (Optional)Memory size of each node, in MB Positive integer Automatically calculated by default

Example

Test data

SQL statement for data generation

  1. drop table if exists enum_feature_selection_test_input;
  2. create table enum_feature_selection_test_input
  3. as
  4. select
  5. *
  6. from
  7. (
  8. select
  9. '00' as col_string,
  10. 1 as col_bigint,
  11. 0.0 as col_double
  12. from dual
  13. union all
  14. select
  15. cast(null as string) as col_string,
  16. 0 as col_bigint,
  17. 0.0 as col_double
  18. from dual
  19. union all
  20. select
  21. '01' as col_string,
  22. 0 as col_bigint,
  23. 1.0 as col_double
  24. from dual
  25. union all
  26. select
  27. '01' as col_string,
  28. 1 as col_bigint,
  29. cast(null as double) as col_double
  30. from dual
  31. union all
  32. select
  33. '01' as col_string,
  34. 1 as col_bigint,
  35. 1.0 as col_double
  36. from dual
  37. union all
  38. select
  39. '00' as col_string,
  40. 0 as col_bigint,
  41. 0.0 as col_double
  42. from dual
  43. ) tmp;

Input data description

  1. +------------+------------+------------+
  2. | col_string | col_bigint | col_double |
  3. +------------+------------+------------+
  4. | 01 | 1 | 1.0 |
  5. | 01 | 0 | 1.0 |
  6. | 01 | 1 | NULL |
  7. | NULL | 0 | 0.0 |
  8. | 00 | 1 | 0.0 |
  9. | 00 | 0 | 0.0 |
  10. +------------+------------+------------+

Running command

  1. drop table if exists enum_feature_selection_test_input_enum_value_output;
  2. drop table if exists enum_feature_selection_test_input_cnt_output;
  3. drop table if exists enum_feature_selection_test_input_value_output;
  4. PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";

UI

image

Parameter UI

image

Running results

image

enum_feature_selection_test_input_cnt_output

  1. +------------+------------+------------+------------+
  2. | colname | colvalue | labelvalue | cnt |
  3. +------------+------------+------------+------------+
  4. | col_double | NULL | 1 | 1 |
  5. | col_double | 0 | 0 | 2 |
  6. | col_double | 0 | 1 | 1 |
  7. | col_double | 1 | 0 | 1 |
  8. | col_double | 1 | 1 | 1 |
  9. | col_string | NULL | 0 | 1 |
  10. | col_string | 00 | 0 | 1 |
  11. | col_string | 00 | 1 | 1 |
  12. | col_string | 01 | 0 | 1 |
  13. | col_string | 01 | 1 | 2 |
  14. +------------+------------+------------+------------+

enum_feature_selection_test_input_value_output

  1. +------------+------------+------------+------------+------------+---------------+
  2. | colname | gini | entropy | infogain | ginigain | infogainratio |
  3. +------------+------------+------------+------------+------------+---------------+
  4. | col_double | 0.3888888888888889 | 0.792481250360578 | 0.20751874963942196 | 0.1111111111111111 | 0.14221913160264427 |
  5. | col_string | 0.38888888888888884 | 0.792481250360578 | 0.20751874963942196 | 0.11111111111111116 | 0.14221913160264427 |
  6. +------------+------------+------------+------------+------------+---------------+

enum_feature_selection_test_input_enum_value_output

  1. +------------+------------+------------+------------+
  2. | colname | colvalue | gini | entropy |
  3. +------------+------------+------------+------------+
  4. | col_double | NULL | 0.0 | 0.0 |
  5. | col_double | 0 | 0.22222222222222224 | 0.4591479170272448 |
  6. | col_double | 1 | 0.16666666666666666 | 0.3333333333333333 |
  7. | col_string | NULL | 0.0 | 0.0 |
  8. | col_string | 00 | 0.16666666666666666 | 0.3333333333333333 |
  9. | col_string | 01 | 0.2222222222222222 | 0.4591479170272448 |
  10. +------------+------------+------------+------------+

T-test

A one-sample T-test tests whether the mean of a normally distributed population differs significantly from a target value. The premise of a T-test is that the sample population follows normal distribution.

PAI command

  1. pai -name t_test -project algo_public
  2. -DxTableName=pai_t_test_all_type
  3. -DxColName=col1_double
  4. -DoutputTableName=pai_t_test_out
  5. -DxTablePartitions=ds=2010/dt=1
  6. -Dalternative=less
  7. -Dmu=47
  8. -DconfidenceLevel=0.95

Parameter description

Parameter Description Value range Required/Optional, default value/act
xTableName Input table x Table name Required
xColName Column that requires a T-test Column name. The type must be double or bigint. Required
outputTableName Output table Non-existent table name Required
xTablePartitions List of partitions in input table x Partition list (Optional) Default value: “”
alternative Alternative hypothesis two.sided, less, greater” (Optional) Default value: two.sided
mu Hypothesized mean double (Optional) Default value: 0
confidenceLevel Confidence level 0.8,0.9,0.95,0.99,0.995,0.999 (Optional) Default value: 0.95

Output description

The output is a table with only one row and one column in the JSON format.

  1. {
  2. "AlternativeHypthesis": "mean not equals to 0",
  3. "ConfidenceInterval": "(44.72234194006504, 46.27765805993496)",
  4. "ConfidenceLevel": 0.95,
  5. "alpha": 0.05,
  6. "df": 99,
  7. "mean": 45.5,
  8. "p": 0,
  9. "stdDeviation": 3.919647479510927,
  10. "t": 116.081867662439
  11. }

Chi-square test

Chi-square Goodness of Fit Test is used to determine the differences between the observed frequencies and the expected frequencies for each classification of a single multiclass classification nominal variable. The null hypothesis assumes that the observed frequencies and the expected frequencies are consistent.

PAI command

  1. PAI -name chisq_test
  2. -project algo_public
  3. -DinputTableName=pai_chisq_test_input
  4. -DcolName=f0
  5. -DprobConfig=0:0.3,1:0.7
  6. -DoutputTableName=pai_chisq_test_output0
  7. -DoutputDetailTableName=pai_chisq_test_output0_detail

Parameter description

Parameter Description Value range Required/Optional, default value/act
inputTableName Input table Table name Required
colName Column requiring a chi-square test Column name Required
outputTableName Output table Non-existent table name Required
outputDetailTableName Output detail table Non-existent table name Required
inputTablePartitions List of partitions in the input table Partition list (Optional) Default value: “”
probConfig Class probability configurations key-value pair; format: class:probability,class:probability…The sum of all probabilities is 1. (Optional) All classes have the same probability by default.

Example

Test data

  1. create table pai_chisq_test_input as
  2. select * from
  3. (
  4. select '1' as f0,'2' as f1 from dual
  5. union all
  6. select '1' as f0,'3' as f1 from dual
  7. union all
  8. select '1' as f0,'4' as f1 from dual
  9. union all
  10. select '0' as f0,'3' as f1 from dual
  11. union all
  12. select '0' as f0,'4' as f1 from dual
  13. )tmp;

PAI command

  1. PAI -name chisq_test
  2. -project algo_public
  3. -DinputTableName=pai_chisq_test_input
  4. -DcolName=f0
  5. -DprobConfig=0:0.3,1:0.7
  6. -DoutputTableName=pai_chisq_test_output0
  7. -DoutputDetailTableName=pai_chisq_test_output0_detail

Output description

Output table outputTableName has only one row and one column in the JSON format.

  1. {
  2. "Chi-Square": {
  3. "comment": "Pearson's chi-square test".
  4. "df": 1,
  5. "p-value": 0.75,
  6. "value": 0.2380952380952381
  7. }
  8. }

Output table outputDetailTableName: Corresponding columns are class, observed frequency (observed), expected frequency (expected), standard error (residuals = (observed-expected) / sqrt(expected)).

chisq_p1

Data view

The data view component provides the data pivot function. It visualizes feature value distribution and feature and label column distribution for users. Facilitates data analysis after the features and characteristics are clear and supports dense or sparse numeric features.

PAI command

  1. PAI -name fe_meta_runner -project algo_public
  2. -DinputTable="pai_dense_10_10"
  3. -DoutputTable="pai_temp_2263_20384_1"
  4. -DmapTable="pai_temp_2263_20384_2"
  5. -DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
  6. -DlabelCol="y"
  7. -DcategoryCols="previous"
  8. -Dlifecycle="28"-DmaxBins="5" ;

Parameter description

Parameter key Description Required/Optional Default value
inputTable Name of the input data table Required NA
inputTablePartitions Partitions in the input table Optional NA
outputTable Name of the output table Required NA
mapTable Output mapping table. The data view maps string-type character strings into numbers (converts to Int to allow machine learning, recognition, and training) Required NA
selectedCols Type of the selected column names in the input table Required NA
categoryCols (Optional) Process Int- or double-type fields as enumeration features NA “”
maxBins (Optional) Maximum number of intervals for same-distance division of consecutive features NA 100
isSparse (Optional) Indicates whether it is a sparse feature in the key:value format. Default: dense data NA 100
itemSpliter (Optional) Delimiter between sparse feature items. Default value: “,” NA “,”
kvSpliter (Optional) Delimiter between sparse feature items. Default value: “:” NA “:”
lifecycle Life cycle of the output table, in days Optional 28

Example

Input data

age workclass fwlght edu edu_num married c family race sex gail loss work_year country income
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174.0 0.0 40.0 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 13.0 United-States <=50K
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0.0 0.0 40.0 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0.0 0.0 40.0 United-States <=50K
28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0.0 0.0 40.0 Cuba <=50K
37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0.0 0.0 40.0 United-States <=50K
49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0.0 0.0 16.0 Jamaica <=50K
52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 45.0 United-States >50K
31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084.0 0.0 50.0 United-States >50K
42 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 5178.0 0.0 40.0 United-States >50K

Modeling DAG

Modeling DAG

image

Data view configuration: select income as the target column and the other 14 fields as the feature columns. Perform enumeration value processing for the edu_num field of the bigint type.

image

Modeling effect

  • Enter data on the left. The family, race, sex, and income fields previously in the format of string are converted into numeric values. This allows data to be learnt by machines for algorithm training (provides the data format conversion function in some extent).
gail loss work_year age fwlght edu_num workclass edu married c family race sex country income
2174.0 0.0 40.0 39.0 77516.0 1.0 3.0 3.0 4.0 1.0 2.0 2.0 2.0 3.0 0.0
0.0 0.0 13.0 50.0 83311.0 1.0 2.0 3.0 2.0 2.0 1.0 2.0 2.0 3.0 0.0
0.0 0.0 40.0 38.0 215646.0 5.0 1.0 4.0 1.0 3.0 2.0 2.0 2.0 3.0 0.0
0.0 0.0 40.0 53.0 234721.0 4.0 1.0 1.0 2.0 3.0 1.0 1.0 2.0 3.0 0.0
0.0 0.0 40.0 28.0 338409.0 1.0 1.0 3.0 2.0 5.0 3.0 1.0 1.0 1.0 0.0
0.0 0.0 40.0 37.0 284582.0 2.0 1.0 5.0 2.0 2.0 3.0 2.0 1.0 3.0 0.0
0.0 0.0 16.0 49.0 160187.0 3.0 1.0 2.0 3.0 4.0 2.0 1.0 1.0 2.0 0.0
0.0 0.0 45.0 52.0 209642.0 5.0 2.0 4.0 2.0 2.0 1.0 2.0 2.0 3.0 1.0
14084.0 0.0 50.0 31.0 45781.0 2.0 1.0 5.0 4.0 5.0 2.0 2.0 1.0 3.0 1.0
5178.0 0.0 40.0 42.0 159449.0 1.0 1.0 3.0 2.0 2.0 1.0 2.0 2.0 3.0 1.0
  • The mapping table is generated on the right.
feature_name feature_value map_id
income <=50 0
income >50K 1
edu_num 13 1
edu_num 14 2
edu_num 5 3
edu_num 7 4
edu_num 9 5
workclass Private 1
workclass Self-emp-not-inc 2
workclass State-gov 3
edu 11th 1
edu 9th 2
edu Bachelors 3
edu HS-grad 4
edu Masters 5
married Divorced 1
married Married-civ-spouse 2
married Married-spouse-absent 3
married Never-married 4
c Adm-clerical 1
c Exec-managerial 2
c Handlers-cleaners 3
c Other-service 4
c Prof-specialty 5
family Husband 1
family Not-in-family 2
family Wife 3
race Black 1
race White 2
sex Female 1
sex Male 2
country Cuba 1
country Jamaica 2
country United-States 3

Covariance

In the probability theory and statistics, covariance reflects the overall error between two variables. Variance is a special covariance when the two variables are the same.
Definition of the covariance between real-number random variable X and Y whose expected values are E(X) = μ and E(Y) = ν respectively:
cov(X, Y) = E((X - μ) (Y - ν))

PAI command

  1. PAI -name cov
  2. -project algo_public
  3. -DinputTableName=maple_test_cov_basic12x10_input
  4. -DoutputTableName=maple_test_cov_basic12x10_output
  5. -DcoreNum=6
  6. -DmemSizePerCore=110;

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table Table name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel partition name format is name1=value1/name2=value2. If you specify multiple partitions, separate them with a comma (,). (Optional) Default: all partitions
outputTableName List of the output table names Table name Required
selectedColNames Selected column name type in the input table Column name (Optional) All columns are selected by default
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum Number of nodes Used together with the memSizePerCore parameter, positive integer in the range of [1, 9999] (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] (Optional) Automatically calculated by default

Empirical probability density chart

An empirical distribution indicates the nonparametric distribution obtained based on probability distribution estimation, when accurate parametric distribution is unavailable. The algorithm uses kernel distribution to estimate the probability density of sample data. Similar to the histogram, the generating function describes sample data distribution. The difference is that kernel distribution overlays contribution of all parts to generate a continuous smooth distribution curve while the histogram provides discrete description. When kernel distribution is used, the probability density of non-sample data points is not 0 but weighted overlay of probability density of sampling points in certain kernel distribution. In this implementation, kernel distribution uses Gaussian distribution.

  • For details about kernel distribution, see wiki.

  • For details about empirical distribution, see wiki.

PAI command

  1. PAI -name empirical_pdf
  2. -project algo_public
  3. -DinputTableName="test_data"
  4. -DoutputTableName="test_epdf_out"
  5. -DfeatureColNames="col0,col1,col2"
  6. -DinputTablePartitions="ds='20160101'"
  7. -Dlifecycle=1
  8. -DintervalNum=100

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Name of the input table NA Required
outputTableName Name of the output table NA Required
featureColNames Input column Multiple double or bigint columns can be selected Required
labelColName Input label column. Calculate the feature column by group based all the label values in this column Select one bigint or string column. The number of label values cannot exceed 100 Optional, “”
inputTablePartitions Partitions in the input table NA Optional, “”
intervalNum Number of calculation frequency intervals. The larger number, the higher accuracy [1,1E14) (Optional) Default value: -1, which indicates that intervals are divided based on the value range of data in each column and the number of intervals is automatically calculated.
lifecycle Life cycle of the output table Positive integer (Optional) Default value: -1, which indicates that life cycle is not set.
coreNum Number of cores, used together with the memSizePerCore parameter Positive integer (Optional) Default value: -1, which indicates that the number of started instances is determined based on the input data amount.
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] (Optional) Default value: -1, which indicates that the memory size is determined based on the input data amount.

Example

Data generation

  1. drop table if exists epdf_test;
  2. create table epdf_test as
  3. select
  4. *
  5. from
  6. (
  7. select 1.0 as col1 from dual
  8. union all
  9. select 2.0 as col1 from dual
  10. union all
  11. select 3.0 as col1 from dual
  12. union all
  13. select 4.0 as col1 from dual
  14. union all
  15. select 5.0 as col1 from dual
  16. ) tmp;

PAI command

  1. PAI -name empirical_pdf
  2. -project algo_public
  3. -DinputTableName=epdf_test
  4. -DoutputTableName=epdf_test_out
  5. -DfeatureColNames=col1;

Input description

You can select multiple columns that need to be calculated. You can select the label column and divide the selected columns into groups based on each label value. For example, if the label column contains value 0 and 1, the columns that need to be calculated are divided into a group where the label is 0 and another group where the label is 1. Then, the probability density can be drawn separately for the two groups. If the label column is not selected, the feature column is calculated as a whole.

Output description

Diagram and result table. The result table fields are as follows. If the label column is not selected, the label output is NULL.

Column name Data type
colName string
label string
x double
pdf double

Output table

  1. +------------+------------+------------+------------+
  2. | colname | label | x | pdf |
  3. +------------+------------+------------+------------+
  4. | col1 | NULL | 1.0 | 0.12775155176809325 |
  5. | col1 | NULL | 1.0404050505050506 | 0.1304256933829622 |
  6. | col1 | NULL | 1.0808101010101012 | 0.13306325897429525 |
  7. | col1 | NULL | 1.1212151515151518 | 0.1356613897616418 |
  8. | col1 | NULL | 1.1616202020202024 | 0.1382173796574596 |
  9. | col1 | NULL | 1.202025252525253 | 0.1407286844875733 |
  10. | col1 | NULL | 1.2424303030303037 | 0.14319293014274642 |
  11. | col1 | NULL | 1.2828353535353543 | 0.14560791960033242 |
  12. | col1 | NULL | 1.3232404040404049 | 0.14797163876379316 |
  13. | col1 | NULL | 1.3636454545454555 | 0.1502822610772349 |
  14. | col1 | NULL | 1.404050505050506 | 0.1525381508819247 |
  15. | col1 | NULL | 1.4444555555555567 | 0.1547378654919243 |
  16. | col1 | NULL | 1.4848606060606073 | 0.1568801559764068 |
  17. | col1 | NULL | 1.525265656565658 | 0.15896396664681753 |
  18. | col1 | NULL | 1.5656707070707085 | 0.16098843325768245 |
  19. | col1 | NULL | 1.6060757575757592 | 0.1629528799404685 |
  20. | col1 | NULL | 1.6464808080808098 | 0.16485681490034038 |
  21. | col1 | NULL | 1.6868858585858604 | 0.16669992491584543 |
  22. | col1 | NULL | 1.727290909090911 | 0.16848206869138338 |
  23. | col1 | NULL | 1.7676959595959616 | 0.17020326912168932 |
  24. | col1 | NULL | 1.8081010101010122 | 0.17186370453638117 |
  25. | col1 | NULL | 1.8485060606060628 | 0.17346369900080946 |
  26. | col1 | NULL | 1.8889111111111134 | 0.17500371175692428 |
  27. | col1 | NULL | 1.929316161616164 | 0.17648432589456017 |
  28. | col1 | NULL | 1.9697212121212146 | 0.17790623634938396 |
  29. | col1 | NULL | 2.0101262626262653 | 0.1792702373286898 |
  30. | col1 | NULL | 2.050531313131316 | 0.18057720927022053 |
  31. | col1 | NULL | 2.0909363636363665 | 0.18182810544221673 |
  32. | col1 | NULL | 2.131341414141417 | 0.18302393829491406 |
  33. | col1 | NULL | 2.1717464646464677 | 0.18416576567472337 |
  34. | col1 | NULL | 2.2121515151515183 | 0.1852546770123305 |
  35. | col1 | NULL | 2.252556565656569 | 0.18629177959496213 |
  36. | col1 | NULL | 2.2929616161616195 | 0.18727818503109434 |
  37. | col1 | NULL | 2.33336666666667 | 0.18821499601297229 |
  38. | col1 | NULL | 2.3737717171717208 | 0.18910329347850022 |
  39. | col1 | NULL | 2.4141767676767714 | 0.18994412426940221 |
  40. | col1 | NULL | 2.454581818181822 | 0.19073848937711185 |
  41. | col1 | NULL | 2.4949868686868726 | 0.19148733286168018 |
  42. | col1 | NULL | 2.535391919191923 | 0.1921915315221827 |
  43. | col1 | NULL | 2.575796969696974 | 0.19285188538972659 |
  44. | col1 | NULL | 2.6162020202020244 | 0.19346910910630113 |
  45. | col1 | NULL | 2.656607070707075 | 0.19404382424446043 |
  46. | col1 | NULL | 2.6970121212121256 | 0.1945765526142701 |
  47. | col1 | NULL | 2.7374171717171762 | 0.19506771059517916 |
  48. | col1 | NULL | 2.777822222222227 | 0.19551760452158667 |
  49. | col1 | NULL | 2.8182272727272775 | 0.19592642714194602 |
  50. | col1 | NULL | 2.858632323232328 | 0.1962942551623821 |
  51. | col1 | NULL | 2.8990373737373787 | 0.1966210478770638 |
  52. | col1 | NULL | 2.9394424242424293 | 0.1969066468790639 |
  53. | col1 | NULL | 2.97984747474748 | 0.19715077683721793 |
  54. | col1 | NULL | 3.0202525252525305 | 0.19735304731663747 |
  55. | col1 | NULL | 3.060657575757581 | 0.19751295561309964 |
  56. | col1 | NULL | 3.1010626262626317 | 0.19762989056457925 |
  57. | col1 | NULL | 3.1414676767676823 | 0.19770313729675995 |
  58. | col1 | NULL | 3.181872727272733 | 0.19773188285349683 |
  59. | col1 | NULL | 3.2222777777777836 | 0.19771522265793107 |
  60. | col1 | NULL | 3.262682828282834 | 0.19765216774530828 |
  61. | col1 | NULL | 3.303087878787885 | 0.19754165270453194 |
  62. | col1 | NULL | 3.3434929292929354 | 0.19738254426210697 |
  63. | col1 | NULL | 3.383897979797986 | 0.19717365043938664 |
  64. | col1 | NULL | 3.4243030303030366 | 0.19691373021193162 |
  65. | col1 | NULL | 3.4647080808080872 | 0.1966015035982942 |
  66. | col1 | NULL | 3.505113131313138 | 0.19623566210464843 |
  67. | col1 | NULL | 3.5455181818181885 | 0.19581487945135703 |
  68. | col1 | NULL | 3.585923232323239 | 0.19533782250778076 |
  69. | col1 | NULL | 3.6263282828282897 | 0.1948031623623475 |
  70. | col1 | NULL | 3.6667333333333403 | 0.1942095854560816 |
  71. | col1 | NULL | 3.707138383838391 | 0.19355580470939734 |
  72. | col1 | NULL | 3.7475434343434415 | 0.19284057057394655 |
  73. | col1 | NULL | 3.787948484848492 | 0.19206268194364004 |
  74. | col1 | NULL | 3.8283535353535427 | 0.19122099686158253 |
  75. | col1 | NULL | 3.8687585858585933 | 0.19031444296253852 |
  76. | col1 | NULL | 3.909163636363644 | 0.1893420275936375 |
  77. | col1 | NULL | 3.9495686868686946 | 0.18830284755928747 |
  78. | col1 | NULL | 3.989973737373745 | 0.1871960984396676 |
  79. | col1 | NULL | 4.030378787878796 | 0.18602108343567092 |
  80. | col1 | NULL | 4.070783838383846 | 0.18477722169674377 |
  81. | col1 | NULL | 4.111188888888897 | 0.1834640560916829 |
  82. | col1 | NULL | 4.151593939393948 | 0.1820812603860928 |
  83. | col1 | NULL | 4.191998989898998 | 0.18062864579383914 |
  84. | col1 | NULL | 4.232404040404049 | 0.179106166873458 |
  85. | col1 | NULL | 4.272809090909099 | 0.17751392674406796 |
  86. | col1 | NULL | 4.31321414141415 | 0.17585218159888508 |
  87. | col1 | NULL | 4.353619191919201 | 0.17412134449794325 |
  88. | col1 | NULL | 4.394024242424251 | 0.1723219884250765 |
  89. | col1 | NULL | 4.434429292929302 | 0.17045484859762067 |
  90. | col1 | NULL | 4.4748343434343525 | 0.16852082402064342 |
  91. | col1 | NULL | 4.515239393939403 | 0.1665209782808102 |
  92. | col1 | NULL | 4.555644444444454 | 0.16445653957824907 |
  93. | col1 | NULL | 4.596049494949504 | 0.16232889999798905 |
  94. | col1 | NULL | 4.636454545454555 | 0.16013961402571825 |
  95. | col1 | NULL | 4.6768595959596055 | 0.1578903963157465 |
  96. | col1 | NULL | 4.717264646464656 | 0.15558311872216193 |
  97. | col1 | NULL | 4.757669696969707 | 0.1532198066072439 |
  98. | col1 | NULL | 4.798074747474757 | 0.1508026344442397 |
  99. | col1 | NULL | 4.838479797979808 | 0.14833392073462115 |
  100. | col1 | NULL | 4.878884848484859 | 0.14581612226291346 |
  101. | col1 | NULL | 4.919289898989909 | 0.1432518277151203 |
  102. | col1 | NULL | 4.95969494949496 | 0.1406437506896507 |
  103. | col1 | NULL | 5.00010000000001 | 0.13799472213247665 |
  104. +------------+------------+------------+------------+

Algorithm scale

If the label column is selected, the number of labels cannot exceed 100.

Box plot

Visualize the box plot and disturbance of a number of continuous features and a specific enumeration feature.

PAI command

  1. PAI -name box_plot -project algo_public
  2. -DinputTable="boxplot"
  3. -DcontinueCols="age"
  4. -DcategoryCol="y"
  5. -DoutputTable="pai_temp_6075_97181_1"
  6. -DsampleSize="1000"
  7. -Dlifecycle="7";

Parameter description

Parameter key Description Required/Optional Default value
inputTable Name of the input data table Required NA
inputTablePartitions Partitions in the input table Optional NA
outputTable Name of the output table that contains the box plot and samples Required NA
continueCols (Required) Continuous value feature. You can select multiple columns. Required NA
categoryCol (Required) Enumeration feature column. Select one column. Required NA
sampleSize Number of samples used for drawing disturbance of each feature NA 1000
lifecycle Life cycle of the output table, in days Optional 28

Example

Input data

  1. create table boxplot as select age, y from bank_data limit 100;
age y
50 0
53 0
28 1
39 0
55 1
30 0
37 0
39 0
36 1
27 0
34 0
41 0
55 1
33 0
26 0
52 0
35 1
27 1
28 0
26 0
41 0
35 0
40 0
32 0
41 0
34 0
49 0
37 0
35 0
38 0
47 0
46 0
27 0
29 1
32 0
36 0
29 0
47 0
44 0
54 0
36 0
42 0
44 0
72 1
48 0
36 0
35 0
43 0
56 0
42 0
31 0
32 0
33 0
31 0
39 0
30 0
24 0
24 0
38 0
26 0
41 0
34 0
30 1
37 0
68 0
31 0
48 0
33 0
59 0
44 0
28 0
50 0
33 0
45 0
40 0
45 0
43 0
54 0
53 0
35 0
30 0
25 0
35 0
54 1
30 0
38 0
35 0
47 0
32 0
27 0
40 1
31 0
42 0
40 0
31 0
57 0
38 1
39 0
37 0
44 0

Parameter settings

Select age as the continuous feature and y as the enumeration feature and use the default value for other parameters.

Running effect

Box plot distribution:

image

Disturbance point:

image

Scatter plot

The Scatter plot component represents a chart where data points are distributed on the Cartesian coordinate plane in regression analysis.

PAI command

  1. PAI -name scatter_diagram -project algo_public
  2. -DselectedCols=emp_var_rate,cons_price_rate,cons_conf_idx,euribor3m
  3. -DsampleSize=1000
  4. -DlabelCol=y
  5. -DmapTable=pai_temp_2447_22859_2
  6. -DinputTable=scatter_diagram
  7. -DoutputTable=pai_temp_2447_22859_1;

Parameter description

Parameter key Description Required/Optional Default value
inputTable Name of the input data table Required NA
inputTablePartitions Partitions in the input table Optional NA
outputTable Name of the output table that contains the samples Required NA
mapTable Output table that contains the maximum value, minimum value, and enumeration values of each feature Required NA
selectedCols Columns selected in the input table, used to draw the scatter chart between every two features. A maximum of five features can be selected. Required NA
labelCol (Optional) Use the Int or String field as the enumeration label column NA “”
sampleSize (Optional) Perform sampling for input data NA 1000
lifecycle Life cycle of the output table, in days Optional 28

Example

Input data

  1. create table scatter_diagram as select emp_var_rate,cons_price_rate, cons_conf_idx,euribor3m,y from pai_bank_data limit 10
emp_var_rate cons_price_rate cons_conf_idx euribor3m y
1.4 93.918 -42.7 4.962 0
-0.1 93.2 -42.0 4.021 0
-1.7 94.055 -39.8 0.729 1
-1.8 93.075 -47.1 1.405 0
-2.9 92.201 -31.4 0.869 1
1.4 93.918 -42.7 4.961 0
-1.8 92.893 -46.2 1.327 0
-1.8 92.893 -46.2 1.313 0
-2.9 92.963 -40.8 1.266 1
-1.8 93.075 -47.1 1.41 0
1.1 93.994 -36.4 4.864 0
1.4 93.444 -36.1 4.964 0
1.4 93.444 -36.1 4.965 1
-1.8 92.893 -46.2 1.291 0
1.4 94.465 -41.8 4.96 0
1.4 93.918 -42.7 4.962 0
-1.8 93.075 -47.1 1.365 1
-0.1 93.798 -40.4 4.86 1
1.1 93.994 -36.4 4.86 0
1.4 93.918 -42.7 4.96 0
-1.8 93.075 -47.1 1.405 0
1.4 94.465 -41.8 4.967 0
1.4 93.918 -42.7 4.963 0
1.4 93.918 -42.7 4.968 0
1.4 93.918 -42.7 4.962 0
-1.8 92.893 -46.2 1.344 0
-3.4 92.431 -26.9 0.754 0
-1.8 93.075 -47.1 1.365 0
-1.8 92.893 -46.2 1.313 0
1.4 93.918 -42.7 4.961 0
1.4 94.465 -41.8 4.961 0
-1.8 92.893 -46.2 1.327 0
-1.8 92.893 -46.2 1.299 0
-2.9 92.963 -40.8 1.268 1
1.4 93.918 -42.7 4.963 0
-1.8 92.893 -46.2 1.334 0
1.4 93.918 -42.7 4.96 0
-1.8 93.075 -47.1 1.405 0
1.4 94.465 -41.8 4.96 0
1.4 93.444 -36.1 4.962 0
1.1 93.994 -36.4 4.86 0
1.1 93.994 -36.4 4.857 0
1.4 93.918 -42.7 4.961 0
-3.4 92.649 -30.1 0.715 1
1.4 93.444 -36.1 4.966 0
-0.1 93.2 -42.0 4.076 0
1.4 93.444 -36.1 4.965 0
-1.8 92.893 -46.2 1.354 0
1.4 93.444 -36.1 4.967 0
1.4 94.465 -41.8 4.959 0
-1.8 92.893 -46.2 1.354 0
1.4 94.465 -41.8 4.958 0
-1.8 92.893 -46.2 1.354 0
1.4 94.465 -41.8 4.864 0
1.1 93.994 -36.4 4.859 0
1.1 93.994 -36.4 4.857 0
-1.8 92.893 -46.2 1.27 0
1.1 93.994 -36.4 4.857 0
1.1 93.994 -36.4 4.859 0
1.4 94.465 -41.8 4.959 0
1.1 93.994 -36.4 4.856 0
-1.8 93.075 -47.1 1.405 0
-1.8 92.843 -50.0 1.811 1
-0.1 93.2 -42.0 4.021 0
-2.9 92.469 -33.6 1.029 0
1.4 93.918 -42.7 4.962 0
-1.8 93.075 -47.1 1.365 0
1.1 93.994 -36.4 4.857 0
-1.8 92.893 -46.2 1.259 0
1.1 93.994 -36.4 4.857 0
1.4 94.465 -41.8 4.866 0
-2.9 92.201 -31.4 0.883 0
-0.1 93.2 -42.0 4.076 0
1.1 93.994 -36.4 4.857 0
1.4 93.918 -42.7 4.96 0
1.4 93.444 -36.1 4.962 0
1.1 93.994 -36.4 4.858 0
1.1 93.994 -36.4 4.857 0
1.1 93.994 -36.4 4.856 0
1.4 93.918 -42.7 4.968 0
1.4 93.444 -36.1 4.966 0
1.4 94.465 -41.8 4.962 0
1.4 93.444 -36.1 4.963 0
-1.8 92.843 -50.0 1.56 1
1.4 93.918 -42.7 4.96 0
1.4 93.444 -36.1 4.963 0
-3.4 92.431 -26.9 0.74 0
1.1 93.994 -36.4 4.856 0
1.4 93.918 -42.7 4.962 0
1.1 93.994 -36.4 4.856 0
-0.1 93.2 -42.0 4.245 1
1.1 93.994 -36.4 4.857 0
-1.8 93.075 -47.1 1.405 0
-1.8 92.893 -46.2 1.327 0
-0.1 93.2 -42.0 4.12 0
1.4 94.465 -41.8 4.958 0
-1.8 93.749 -34.6 0.659 1
1.1 93.994 -36.4 4.858 0
1.1 93.994 -36.4 4.858 0
1.4 93.444 -36.1 4.963 0

Parameter settings

Select select emp_var_rate, cons_price_rate, cons_conf_idx, and euribor3m as the feature columns, and select y for the optional label columns for the scatter plot.

Running effect

Distribution of classification tags between features is directly displayed.

image

Correlation matrix

The correlation coefficient is used to measure the correlation between columns in a matrix. The value range of the correlation coefficient is [-1,1]. The value of count is the number of non-zero elements in both successive columns. This value may vary with columns.

PAI command

  1. PAI -name corrcoef
  2. -project algo_public
  3. -DinputTableName=maple_test_corrcoef_basic12x10_input
  4. -DoutputTableName=maple_test_corrcoef_basic12x10_output
  5. -DcoreNum=1
  6. -DmemSizePerCore=110;

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table Table name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel partition name format is name1=value1/name2=value2. If you specify multiple partitions, separate them with a comma (,). (Optional) Default: all partitions
outputTableName List of the output table names Table name Required
selectedColNames Selected column name type in the input table Column name (Optional) All columns are selected by default
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum Number of nodes Used together with the memSizePerCore parameter. Positive integer in the range of [1, 9999]. Detailed description (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] Detailed description (Optional) Automatically calculated by default

Example

Data generation

col0:double col1:bigint col2:double col3:bigint col4:double col5:bigint col6:double col7:bigint col8:double col9:double
19 95 33 52 115 43 32 98 76 40
114 26 101 69 56 59 116 23 109 105
103 89 7 9 65 118 73 50 55 81
79 20 63 71 5 24 77 31 21 75
87 16 66 47 25 14 42 99 108 57
11 104 38 37 106 51 3 91 80 97
84 30 70 46 8 6 94 22 45 48
35 17 107 64 10 78 53 34 90 96
13 61 39 1 29 117 112 2 82 28
62 4 102 88 100 36 67 54 12 85
49 27 44 93 68 110 60 72 86 58
92 119 0 113 41 15 74 83 18 111

PAI command

  1. PAI -name corrcoef
  2. -project algo_public
  3. -DinputTableName=maple_test_corrcoef_basic12x10_input
  4. -DoutputTableName=maple_test_corrcoef_basic12x10_output
  5. -DcoreNum=1
  6. -DmemSizePerCore=110;

Output table

columnsnames col0 col1 col2 col3 col4 col5 col6 col7 col8 col9
col0 1 -0.2115657251820724 0.0598306259706561 0.2599903570684693 -0.3483249188225586 -0.28716254396809926 0.47880162127435116 -0.13646519484213326 -0.19500158764680092 0.3897390240949085
col1 -0.2115657251820724 1 -0.8444477377898585 -0.17507636221594533 0.40943384150571377 0.09135976026101403 -0.3018506374626574 0.40733726912808044 -0.11827739124590071 0.12433851389455183
col2 0.0598306259706561 -0.8444477377898585 1 0.18518346647293102 -0.20934839228057014 -0.1896417512389659 0.1799377498863213 -0.3858885676469948 0.20254569203773892 0.13476160753756655
col3 0.2599903570684693 -0.17507636221594533 0.18518346647293102 1 0.03988018649854009 -0.43737887418329147 -0.053818296425267184 0.2900856441586986 -0.3607547910075688 0.4912019074930449
col4 -0.3483249188225586 0.40943384150571377 -0.20934839228057014 0.03988018649854009 1 0.1465605209246875 -0.5016030364347955 0.5496024325711117 0.013743256115394122 0.07497231559184887
col5 -0.28716254396809926 0.09135976026101403 -0.1896417512389659 -0.43737887418329147 0.1465605209246875 1 0.16729809310873522 -0.29890655828796964 0.3618518101014617 -0.1713960957286885
col6 0.47880162127435116 -0.3018506374626574 0.1799377498863213 -0.053818296425267184 -0.5016030364347955 0.16729809310873522 1 -0.8165019880156462 -0.11173420918721436 -0.10363860378347944
col7 -0.13646519484213326 0.40733726912808044 -0.3858885676469948 0.2900856441586986 0.5496024325711117 -0.29890655828796964 -0.8165019880156462 1 0.07435907471544469 0.11711976051999162
col8 -0.19500158764680092 -0.11827739124590071 0.20254569203773892 -0.3607547910075688 0.013743256115394122 0.3618518101014617 -0.11173420918721436 0.07435907471544469 1 -0.18463012549540175
col9 0.3897390240949085 0.12433851389455183 0.13476160753756655 0.4912019074930449 0.07497231559184887 -0.1713960957286885 -0.10363860378347944 0.11711976051999162 -0.18463012549540175 1

Normality test

Normality tests are used to determine whether a data set is well-modeled by a normal distribution. This component consists of three test methods: Anderson-Darling Test (for details, see wiki), Kolmogorov-Smirnov Test (for details, see wiki), and QQ Plot (for details, see wiki). Use one or more methods based on requirements.

Algorithm description

  • Original hypothesis H0: The observed values are in a normal distribution; H1: The observed values are not in a normal distribution.

  • KS p-value calculation method progressively calculates the CDF of KS distribution regardless of the sample size. For details, see wiki.

  • If the sample size is greater than 1,000, the QQ Plot method samples to calculate and output plots, which means the data points in the plot do not necessarily cover all the samples.

PAI command

  1. PAI -name normality_test
  2. -project algo_public
  3. -DinputTableName=test
  4. -DoutputTableName=test_out
  5. -DselectedColNames=col1,col2
  6. -Dlifecycle=1;

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table NA Required
outputTableName Name of the output table NA Required
selectedColNames Selected field columns You can select multiple double or bigint columns. Optional
inputTablePartitions Input table partitions NA Optional, “”
enableQQplot Use the QQ plot NA (Optional) Default value: true
enableADtest Use the Anderson-Darling test NA (Optional) Default value: true
enableKStest Use the Kolmogorov-Smirnov test NA (Optional) Default value: true
lifecycle Life cycle An integer equal to or more than -1 (Optional) Default value: -1, which indicates that life cycle is not set.
coreNum Number of cores An integer more than 0 (Optional) Default value: -1, which indicates that the number of started instances is determined based on the input data volume.
memSizePerCore Memory size of each node, in MB [100,64*1024] (Optional) Default value: -1, which indicates that the memory size is determined based on the input data volume.

Example

Data generation

  1. drop table if exists normality_test_input;
  2. create table normality_test_input as
  3. select
  4. *
  5. from
  6. (
  7. select 1 as x from dual
  8. union all
  9. select 2 as x from dual
  10. union all
  11. select 3 as x from dual
  12. union all
  13. select 4 as x from dual
  14. union all
  15. select 5 as x from dual
  16. union all
  17. select 6 as x from dual
  18. union all
  19. select 7 as x from dual
  20. union all
  21. select 8 as x from dual
  22. union all
  23. select 9 as x from dual
  24. union all
  25. select 10 as x from dual
  26. ) tmp;

PAI command

  1. PAI -name normality_test
  2. -project projectxlib4
  3. -DinputTableName=normality_test_input
  4. -DoutputTableName=normality_test_output
  5. -DselectedColNames=x
  6. -Dlifecycle=1;

Input description

Input format: Select the columns that need to be calculated. You can select multiple double or bigint columns.

Output description

  • Output format: Diagram and result table. The result table fields are as follows: If the table has two partitions, the p=’test’ partition provides the results of the AD or KS test only when the enableADtest or enableKStest parameter is set to true.

  • The p=’plot’ partition provides the QQ plot data only when the enableQQplot parameter is set to true, and the p=’test’ column is reused. That is, when p=’plot’, the testvalue column records the original observed data (X-axis of the QQ plot), and the pvalue column records the expected data (Y-axis of the QQ plot) when the data follows normal distribution.

Column name Data type Description
colName String Column name
testname String Test name
testvalue double Test value/X-axis of the QQ plot
pvalue double Test p value/Y-axis of the QQ plot
p double Partition name

Output table

  1. +------------+------------+------------+------------+------------+
  2. | colname | testname | testvalue | pvalue | p |
  3. +------------+------------+------------+------------+------------+
  4. | x | NULL | 1.0 | 0.8173291742279805 | plot |
  5. | x | NULL | 2.0 | 2.470864450785345 | plot |
  6. | x | NULL | 3.0 | 3.5156067948020056 | plot |
  7. | x | NULL | 4.0 | 4.3632330349313095 | plot |
  8. | x | NULL | 5.0 | 5.128868067945126 | plot |
  9. | x | NULL | 6.0 | 5.871131932054874 | plot |
  10. | x | NULL | 7.0 | 6.6367669650686905 | plot |
  11. | x | NULL | 8.0 | 7.4843932051979944 | plot |
  12. | x | NULL | 9.0 | 8.529135549214654 | plot |
  13. | x | NULL | 10.0 | 10.182670825772018 | plot |
  14. | x | Anderson_Darling_Test | 0.1411092332197832 | 0.9566579606430077 | test |
  15. | x | Kolmogorov_Smirnov_Test | 0.09551932503797644 | 0.9999888659426232 | test |
  16. +------------+------------+------------+------------+------------+

Lorenz curve

The Lorenz curve studies how income is distributed among a population. To study the distribution of income among a population, the American statistician (or Austrian statistician) M.O.Lorenz (Max Otto Lorenz, 1903 - ) proposed the famous Lorenz curve in 1907 (or 1905). Later, the Italian economist Gini defined the Gini Coefficient based on Lorenz curve.

The Gini Coefficient is represented by a rectangle. The height of the rectangle measures the percentage of social wealth and is divided into N equal fractions, each of which equals to 1/N of total social wealth. On the length of the rectangle, all families are arranged from left to right in the poorest to the richest order and are also divided into N equal fractions. The first fraction represents the 1/N of the families with lowest income. In this rectangle, cumulating the proportions of the wealth owned by every 1/N families and illustrating them as points in the plot produces a curve, namely, the Lorenz curve.

PAI command

  1. PAI -name LorenzCurve
  2. -project algo_public
  3. -DinputTableName=maple_test_lorenz_basic10_input
  4. -DcolName=col0
  5. -DoutputTableName=maple_test_lorenz_basic10_output -DcoreNum=20
  6. -DmemSizePerCore=110;

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table Table name Required
outputTableName Name of the output table Required NA
colName Column names separated by commas (,) NA (Optional) By default, the entire table is selected.
N Quantile count (Optional) Default value: 100 (Optional) Default value: 100
inputPartitions Partitions used for training in the input table, in the format of Partition_name=value The multilevel partition name format is name1=value1/name2=value2. If you specify multiple partitions, separate them with a comma (,). NA (Optional) Default: all partitions
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum Number of nodes Used together with the memSizePerCore parameter, positive integer in the range of [1, 9999]. [Detailed description] (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] (Optional) Automatically calculated by default

Example

Data generation

col0:double
4
7
2
8
6
3
9
5
0
1
10

PAI command

  1. PAI -name LorenzCurve
  2. -project algo_public
  3. -DinputTableName=maple_test_lorenz_basic10_input
  4. -DcolName=col0
  5. -DoutputTableName=maple_test_lorenz_basic10_output
  6. -DcoreNum=20
  7. -DmemSizePerCore=110;

Output description

Output table

quantile col0
0 0
1 0.01818181818181818
2 0.01818181818181818
3 0.01818181818181818
4 0.01818181818181818
5 0.01818181818181818
6 0.01818181818181818
7 0.01818181818181818
8 0.01818181818181818
9 0.01818181818181818
10 0.01818181818181818
11 0.05454545454545454
12 0.05454545454545454
13 0.05454545454545454
14 0.05454545454545454
85 0.8181818181818182
86 0.8181818181818182
87 0.8181818181818182
88 0.8181818181818182
89 0.8181818181818182
90 1
91 1
92 1
93 1
94 1
95 1
96 1
97 1
98 1
99 1
100 1
Thank you! We've received your feedback.