Normality Test is a statistical method used to determine if a dataset is derived from a normally distributed population. The test includes methods such as the Anderson-Darling test, Kolmogorov-Smirnov test, and QQ plot test, which assess the distribution characteristics of a dataset to support further statistical analysis and modeling.
Algorithm description
The Normality Test component offers the Anderson-Darling Test, Kolmogorov-Smirnov Test, and QQ plot test methods. You can select one or multiple methods for testing.
-
Anderson-Darling test: This enhanced goodness-of-fit test method emphasizes the tail differences of a distribution. It measures how well sample data fits a particular theoretical distribution by evaluating the squared differences of the weighted cumulative distribution function.
-
Kolmogorov-Smirnov test: As a non-parametric method, this test compares a sample distribution with a reference distribution or two sample distributions. It calculates the maximum difference between their cumulative distribution functions to assess the goodness-of-fit.
-
QQ plot test: This graphical tool is used for visually comparing sample distributions to theoretical distributions or between two sample distributions. It identifies distribution discrepancies by comparing quantiles.
Configure the component
Method 1: Configure the component on the pipeline page
Add a Normality Test component on the pipeline page and configure the following parameters:
|
Category |
Parameter |
Description |
|
Fields Setting |
Columns |
The column to perform the normality test on. |
|
Parameters Setting |
Anderson-Darling Test |
Whether to perform the Anderson-Darling test. |
|
Kolmogorov-Smirnov Test |
Whether to perform the Kolmogorov-Smirnov test. |
|
|
Use QQ Plot |
Whether to perform the QQ plot test. |
|
|
Tuning |
Computing Cores |
The number of cores used in computing. The value must be a positive integer. |
|
Memory Size per Core (Unit: MB) |
The memory size of each core. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL script.
PAI -name normality_test
-project algo_public
-DinputTableName=test
-DoutputTableName=test_out
-DselectedColNames=col1,col2
-Dlifecycle=1;
|
Parameter |
Required |
Default Value |
Description |
|
inputTableName |
Yes |
None |
The name of the input table to be tested. |
|
outputTableName |
Yes |
None |
The names of the output tables. |
|
selectedColNames |
No |
None |
The columns selected from the input table. You can select multiple columns of the DOUBLE or BIGINT type. |
|
inputTablePartitions |
No |
"" |
The name of the partition of the input table. |
|
enableQQplot |
No |
true |
Whether to perform the QQ plot test. |
|
enableADtest |
No |
true |
Whether to perform the Anderson-Darling test. |
|
enableKStest |
No |
true |
Whether to perform the Kolmogorov-Smirnov test. |
|
lifecycle |
No |
-1 |
The lifecycle of the output table. The value is an integer that is greater than or equal to -1. Default value: -1. This value indicates that the lifecycle of the output table is not set. |
|
coreNum |
No |
-1 |
This parameter is used with memSizePerCore. The value must be a positive integer. Default value: -1. This value indicates that the number of instances is determined by the amount of input data. |
|
memSizePerCore |
No |
-1 |
The memory size of each core. Unit: MB. The value must be positive integer. Valid values: (100,64 × 1024). Default value: -1. This value indicates that the memory size of each core is determined by the amount of input data. |
Examples
-
Add a SQL Script component, deselect Use Script Mode and Whether the system adds a create table statement. Enter the following SQL statement.
drop table if exists normality_test_input; create table normality_test_input as select * from ( select 1 as x union all select 2 as x union all select 3 as x union all select 4 as x union all select 5 as x union all select 6 as x union all select 7 as x union all select 8 as x union all select 9 as x union all select 10 as x ) tmp; -
Add another SQL script component, deselect Use Script Mode and Whether the system adds a create table statement. Enter the following PAI command, and connect the components from Step 1 and Step 2.
drop table if exists ${o1}; PAI -name normality_test -project algo_public -DinputTableName=normality_test_input -DoutputTableName=${o1} -DselectedColNames=x -Dlifecycle=1; -
Click the
icon in the upper left corner to run the pipeline. -
Right-click the SQL Script component created in Step 2 and choose View Data > SQL Script Output to view the training results.
| colname | testname | testvalue | pvalue | | ------- | ----------------------- | ------------------- | ------------------ | | x | | 1.0 | 0.8173291742279805 | | x | | 2.0 | 2.470864450785345 | | x | | 3.0 | 3.5156067948020056 | | x | | 4.0 | 4.3632330349313095 | | x | | 5.0 | 5.128868067945126 | | x | | 6.0 | 5.871131932054874 | | x | | 7.0 | 6.6367669650686905 | | x | | 8.0 | 7.4843932051979944 | | x | | 9.0 | 8.529135549214654 | | x | | 10.0 | 10.182670825772018 | | x | Anderson_Darling_Test | 0.1411092332197832 | 0.9566579606430077 | | x | Kolmogorov_Smirnov_Test | 0.09551932503797644 | 0.9999888659426232 |