Normality test is a special goodness-of-fit hypothesis test in statistical determination. It determines whether the population follows normal distribution by using observations. This topic describes the Normality Test component provided by Machine Learning Studio.

The Normality Test component consists of Anderson-Darling Test, Kolmogorov-Smirnov Test, and Q-Q Plot tests. You can select one or more test methods based on your business requirements.
  • The Anderson–Darling test compares the empirical distribution function of sample data with the expected normal distribution. If the difference is large, the test negates the hypothesis that the population has a normal distribution.
  • The Kolmogorov-Smirnov test compares the distribution of two observations.
  • A Q–Q plot tests the distribution of data by comparing the quantile of test sample data with the known distribution. If more than 1,000 samples are collected, the system uses these samples for calculation and generates a Q–Q plot. The data points in the plot do not necessarily cover all the samples.

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Columns N/A.
    Parameters Setting Anderson-Darling Test Valid values:
    • Yes
    • No
    Default value: Yes.
    Kolmogorov-Smirnov Test Valid values:
    • Yes
    • No
    Default value: Yes.
    Use Q-Q Plot Valid values:
    • Yes
    • No
    Default value: Yes.
    Tuning Computing Cores The number of cores used in computing. The value must be a positive integer.
    Memory Size per Core (Unit: MB) The memory of each core.
  • PAI command
    PAI -name normality_test
        -project algo_public
        -DinputTableName=test
        -DoutputTableName=test_out
        -DselectedColNames=col1,col2
        -Dlifecycle=1;
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    outputTableName Yes The names of output tables. No default value
    selectedColNames No The columns selected from the input table. You can select multiple columns of the DOUBLE or BIGINT type. No default value
    inputTablePartitions No The name of the partition of the input table. ""
    enableQQplot No Specifies whether to use Q–Q plot testing. Valid values: true and false. ture
    enableADtest No Optional. This parameter specifies whether to perform the Anderson-Darling test. Valid values: true and false. ture
    enableKStest No Specifies whether to perform the Kolmogorov-Smirnov test. Valid values: true and false. ture
    lifecycle No The lifecycle of the output table. The value is an integer that is greater than or equal to -1. -1
    coreNum No This parameter is used with memSizePerCore. The value must be a positive integer. The system calculates the number of instances based on the amount of input data. -1
    memSizePerCore No The memory size of each core. Unit: MB. The value is a positive integer in the range of (100, 64 × 1024). The system calculates the memory size based on the amount of input data. -1

Example

  • Input
        drop table if exists normality_test_input;
        create table normality_test_input as
        select
          *
        from
        (
          select 1 as x from dual
            union all
          select 2 as x from dual
            union all
          select 3 as x from dual
            union all
          select 4 as x from dual
            union all
          select 5 as x from dual
            union all
          select 6 as x from dual
            union all
          select 7 as x from dual
            union all
          select 8 as x from dual
            union all
          select 9 as x from dual
            union all
          select 10 as x from dual
        ) tmp;
  • PAI command
    PAI -name normality_test
        -project algo_public
        -DinputTableName=normality_test_input
        -DoutputTableName=normality_test_output
        -DselectedColNames=x
        -Dlifecycle=1;
  • Input description

    Input format: Select columns required for calculation. You can select multiple columns. The data type is DOUBLE or BIGINT.

  • Output description
    Output format: A diagram and a result table are provided. The following table lists fields in the result table. The result table has two partitions:
    • The partition p=Test lists the results of a Anderson-Darling or Kolmogorov-Smirnov test. Data is provided if enableADtest or enableKStest is set to true.
    • The partition p=plot lists the results of a Q–Q plot test. Data is provided if enableQQplot is set to true. The column p=test is reused. If the partition p='plot' is used, the testvalue column records the original observation (x-axis of the Q–Q plot), and the pvalue column records the expected data that is normally distributed (y-axis of the Q–Q plot).
    Column name Data type Description
    colName STRING Column name
    testname STRING Test name
    testvalue DOUBLE Test value or the x-axis of the Q–Q plot
    pvalue DOUBLE p value or the y-axis of the Q–Q plot
    p DOUBLE The partition name
    Output
    +------------+------------+------------+------------+------------+
    | colname    | testname   | testvalue  | pvalue     | p          |
    +------------+------------+------------+------------+------------+
    | x          | NULL       | 1.0        | 0.8173291742279805 | plot       |
    | x          | NULL       | 2.0        | 2.470864450785345  | plot       |
    | x          | NULL       | 3.0        | 3.5156067948020056 | plot       |
    | x          | NULL       | 4.0        | 4.3632330349313095 | plot       |
    | x          | NULL       | 5.0        | 5.128868067945126  | plot       |
    | x          | NULL       | 6.0        | 5.871131932054874  | plot       |
    | x          | NULL       | 7.0        | 6.6367669650686905 | plot       |
    | x          | NULL       | 8.0        | 7.4843932051979944 | plot       |
    | x          | NULL       | 9.0        | 8.529135549214654  | plot       |
    | x          | NULL       | 10.0       | 10.182670825772018 | plot       |
    | x          | Anderson_Darling_Test | 0.1411092332197832   | 0.9566579606430077 | test       |
    | x          | Kolmogorov_Smirnov_Test | 0.09551932503797644 | 0.9999888659426232 | test       |
    +------------+------------+------------+------------+------------+