A normality test is a special goodness-of-fit hypothesis test in statistical determination. A normality test determines whether the population follows normal distribution by using observations. This topic describes the Normality Test component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

The Normality Test component consists of the Anderson-Darling Test, Kolmogorov-Smirnov Test, and Q-Q Plot methods. You can select one or more test methods based on your business requirements.
  • An Anderson-Darling test compares the empirical distribution function of sample data with the expected normal distribution. If the difference is large, the test negates the hypothesis that the population has a normal distribution.
  • A Kolmogorov-Smirnov test compares the distribution of two observations.
  • A Q-Q plot tests the distribution of data by comparing the quantile of test sample data with the known distribution. If more than 1,000 samples are collected, the system uses these samples for calculation and generates a Q-Q plot. The data points in the plot do not necessarily cover all the samples.

Configure the component

You can use one of the following methods to configure the Normality Test component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Normality Test component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
TabParameterDescription
Fields SettingColumnsN/A
Parameters SettingAnderson-Darling TestValid values:
  • Yes
  • No
Default value: Yes.
Kolmogorov-Smirnov TestValid values:
  • Yes
  • No
Default value: Yes.
Use Q-Q PlotValid values:
  • Yes
  • No
Default value: Yes.
TuningComputing CoresThe number of cores used in computing. The value must be a positive integer.
Memory Size per Core (Unit: MB)The memory size of each core.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name normality_test
    -project algo_public
    -DinputTableName=test
    -DoutputTableName=test_out
    -DselectedColNames=col1,col2
    -Dlifecycle=1;
ParameterRequiredDescriptionDefault
inputTableNameYesThe name of the input table. N/A
outputTableNameYesThe name of the output table. N/A
selectedColNamesNoThe columns selected from the input table. You can select multiple columns of the DOUBLE or BIGINT type. N/A
inputTablePartitionsNoThe name of the partition of the input table. ""
enableQQplotNoSpecifies whether to use a Q-Q plot. Valid values: true and false. true
enableADtestNoSpecifies whether to perform an Anderson-Darling test. Valid values: true and false. true
enableKStestNoSpecifies whether to perform a Kolmogorov-Smirnov test. Valid values: true and false. true
lifecycleNoThe lifecycle of the output table. The value is an integer that is greater than or equal to -1. Default value: -1. This value indicates that the lifecycle of the output table is not set. -1
coreNumNoThis parameter is used with memSizePerCore. The value must be a positive integer. Default value: -1. This value indicates that the number of instances is determined by the amount of input data. -1
memSizePerCoreNoThe memory size of each core. Unit: MB. The value must be positive integer. Valid values: (100,64 × 1024). Default value: -1. This value indicates that the memory size of each core is determined by the amount of input data. -1

Examples

  • Input data
        drop table if exists normality_test_input;
        create table normality_test_input as
        select
          *
        from
        (
          select 1 as x from dual
            union all
          select 2 as x from dual
            union all
          select 3 as x from dual
            union all
          select 4 as x from dual
            union all
          select 5 as x from dual
            union all
          select 6 as x from dual
            union all
          select 7 as x from dual
            union all
          select 8 as x from dual
            union all
          select 9 as x from dual
            union all
          select 10 as x from dual
        ) tmp;
  • PAI command
    PAI -name normality_test
        -project algo_public
        -DinputTableName=normality_test_input
        -DoutputTableName=normality_test_output
        -DselectedColNames=x
        -Dlifecycle=1;
  • Input description

    Input format: Select columns required for calculation. You can select multiple columns. The data type is DOUBLE or BIGINT.

  • Output description
    Output format: A diagram and a result table are provided. The following table describes fields in the result table. The result table has two partitions:
    • The partition p=test lists the results of an Anderson-Darling or Kolmogorov-Smirnov test. Data is provided if the enableADtest or enableKStest parameter is set to true.
    • The partition p=plot lists the results of a Q-Q plot test. Data is provided if the enableQQplot parameter is set to true. The column p=test is reused. If the partition p=plot is used, the testvalue column records the original observation (x-axis of the Q-Q plot), and the pvalue column records the expected data that is normally distributed (y-axis of the Q-Q plot).
    ColumnData typeDescription
    colNameSTRINGThe column name.
    testnameSTRINGThe test name.
    testvalueDOUBLEThe test value or the x-axis of the Q-Q plot.
    pvalueDOUBLEThe p value or the y-axis of the Q-Q plot.
    pDOUBLEThe partition name.
    Output table
    +------------+------------+------------+------------+------------+
    | colname    | testname   | testvalue  | pvalue     | p          |
    +------------+------------+------------+------------+------------+
    | x          | NULL       | 1.0        | 0.8173291742279805 | plot       |
    | x          | NULL       | 2.0        | 2.470864450785345  | plot       |
    | x          | NULL       | 3.0        | 3.5156067948020056 | plot       |
    | x          | NULL       | 4.0        | 4.3632330349313095 | plot       |
    | x          | NULL       | 5.0        | 5.128868067945126  | plot       |
    | x          | NULL       | 6.0        | 5.871131932054874  | plot       |
    | x          | NULL       | 7.0        | 6.6367669650686905 | plot       |
    | x          | NULL       | 8.0        | 7.4843932051979944 | plot       |
    | x          | NULL       | 9.0        | 8.529135549214654  | plot       |
    | x          | NULL       | 10.0       | 10.182670825772018 | plot       |
    | x          | Anderson_Darling_Test | 0.1411092332197832   | 0.9566579606430077 | test       |
    | x          | Kolmogorov_Smirnov_Test | 0.09551932503797644 | 0.9999888659426232 | test       |
    +------------+------------+------------+------------+------------+