Normality Test is a statistical method used to determine if a dataset is derived from a normally distributed population. The test includes methods such as the Anderson-Darling test, Kolmogorov-Smirnov test, and QQ plot test, which assess the distribution characteristics of a dataset to support further statistical analysis and modeling.
Algorithm description
The Normality Test component offers the Anderson-Darling Test, Kolmogorov-Smirnov Test, and QQ plot test methods. You can select one or multiple methods for testing.
Anderson-Darling test: This enhanced goodness-of-fit test method emphasizes the tail differences of a distribution. It measures how well sample data fits a particular theoretical distribution by evaluating the squared differences of the weighted cumulative distribution function.
Kolmogorov-Smirnov test: As a non-parametric method, this test compares a sample distribution with a reference distribution or two sample distributions. It calculates the maximum difference between their cumulative distribution functions to assess the goodness-of-fit.
QQ plot test: This graphical tool is used for visually comparing sample distributions to theoretical distributions or between two sample distributions. It identifies distribution discrepancies by comparing quantiles.
Configure the component
Method 1: Configure the component on the pipeline page
Add a Normality Test component on the pipeline page and configure the following parameters:
Category | Parameter | Description |
Fields Setting | Columns | The column to perform the normality test on. |
Parameters Setting | Anderson-Darling Test | Whether to perform the Anderson-Darling test. |
Kolmogorov-Smirnov Test | Whether to perform the Kolmogorov-Smirnov test. | |
Use QQ Plot | Whether to perform the QQ plot test. | |
Tuning | Computing Cores | The number of cores used in computing. The value must be a positive integer. |
Memory Size per Core (Unit: MB) | The memory size of each core. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name normality_test
-project algo_public
-DinputTableName=test
-DoutputTableName=test_out
-DselectedColNames=col1,col2
-Dlifecycle=1;Parameter | Required | Default Value | Description |
inputTableName | Yes | None | The name of the input table to be tested. |
outputTableName | Yes | None | The names of the output tables. |
selectedColNames | No | None | The columns selected from the input table. You can select multiple columns of the DOUBLE or BIGINT type. |
inputTablePartitions | No | "" | The name of the partition of the input table. |
enableQQplot | No | true | Whether to perform the QQ plot test. |
enableADtest | No | true | Whether to perform the Anderson-Darling test. |
enableKStest | No | true | Whether to perform the Kolmogorov-Smirnov test. |
lifecycle | No | -1 | The lifecycle of the output table. The value is an integer that is greater than or equal to -1. Default value: -1. This value indicates that the lifecycle of the output table is not set. |
coreNum | No | -1 | This parameter is used with memSizePerCore. The value must be a positive integer. Default value: -1. This value indicates that the number of instances is determined by the amount of input data. |
memSizePerCore | No | -1 | The memory size of each core. Unit: MB. The value must be positive integer. Valid values: (100,64 × 1024). Default value: -1. This value indicates that the memory size of each core is determined by the amount of input data. |
Examples
Add a SQL Script component, deselect Use Script Mode and Whether the system adds a create table statement. Enter the following SQL statement.
drop table if exists normality_test_input; create table normality_test_input as select * from ( select 1 as x union all select 2 as x union all select 3 as x union all select 4 as x union all select 5 as x union all select 6 as x union all select 7 as x union all select 8 as x union all select 9 as x union all select 10 as x ) tmp;Add another SQL script component, deselect Use Script Mode and Whether the system adds a create table statement. Enter the following PAI command, and connect the components from Step 1 and Step 2.
drop table if exists ${o1}; PAI -name normality_test -project algo_public -DinputTableName=normality_test_input -DoutputTableName=${o1} -DselectedColNames=x -Dlifecycle=1;Click the
icon in the upper left corner to run the pipeline.Right-click the SQL Script component created in Step 2 and choose View Data > SQL Script Output to view the training results.
| colname | testname | testvalue | pvalue | | ------- | ----------------------- | ------------------- | ------------------ | | x | | 1.0 | 0.8173291742279805 | | x | | 2.0 | 2.470864450785345 | | x | | 3.0 | 3.5156067948020056 | | x | | 4.0 | 4.3632330349313095 | | x | | 5.0 | 5.128868067945126 | | x | | 6.0 | 5.871131932054874 | | x | | 7.0 | 6.6367669650686905 | | x | | 8.0 | 7.4843932051979944 | | x | | 9.0 | 8.529135549214654 | | x | | 10.0 | 10.182670825772018 | | x | Anderson_Darling_Test | 0.1411092332197832 | 0.9566579606430077 | | x | Kolmogorov_Smirnov_Test | 0.09551932503797644 | 0.9999888659426232 |