This topic describes the Chi-square Goodness of Fit Test component provided by Machine Learning Studio.

Configure the component

The Chi-square Goodness of Fit Test component is used in scenarios where categorical variables are used. This component is used to determine the difference between the observed frequency and expected frequency for each classification of a single multiclass categorical variable. The null hypothesis assumes that the observed frequency and expected frequency are the same. You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI (PAI) console
    Parameter Description
    Input Column The column on which you want to perform a chi-square test.
    Class Probability The class probability configuration. Specify this parameter in the Class:Probability format. The sum of all probabilities is 1.
  • PAI command
    PAI -name chisq_test
        -project algo_public
        -DinputTableName=pai_chisq_test_input
        -DcolName=f0
        -DprobConfig=0:0.3,1:0.7
        -DoutputTableName=pai_chisq_test_output0
        -DoutputDetailTableName=pai_chisq_test_output0_detail
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    colName Yes The name of the column. No default value
    outputTableName Yes The name of the output table. No default value
    outputDetailTableName Yes The name of the output detail table. No default value
    inputTablePartitions No The partitions that you want to select from the input table for training. Specify this parameter in one of the following formats:
    • Partition_name=value
    • Multi-level partition: name1=value1/name2=value2
    Note If you specify multiple partitions, separate them with commas (,).
    No default value
    probConfig No The class probability configuration. Specify this parameter in the Class:Probability format. The sum of all probabilities is 1. No default value (If this parameter is not specified, all the probability values are the same.)

Example

  • Test data
    create table pai_chisq_test_input as
    select * from
    (
      select '1' as f0,'2' as f1 from dual
      union all
      select '1' as f0,'3' as f1 from dual
      union all
      select '1' as f0,'4' as f1 from dual
      union all
      select '0' as f0,'3' as f1 from dual
      union all
      select '0' as f0,'4' as f1 from dual
    )tmp;
  • PAI command
    PAI -name chisq_test
        -project algo_public
        -DinputTableName=pai_chisq_test_input
        -DcolName=f0
        -DprobConfig=0:0.3,1:0.7
        -DoutputTableName=pai_chisq_test_output0
        -DoutputDetailTableName=pai_chisq_test_output0_detail
  • Output
    • The output table that is specified by outputTableName is in the JSON format. It contains only one row and one column.
      {
          "Chi-Square": {
              "comment": "Pearson's chi-square test",
              "df": 1,
              "p-value": 0.75,
              "value": 0.2380952380952381
          }
      }
    • The following table lists the columns in the output detail table that is specified by outputDetailTableName.
      column name comment
      colName The data source class.
      observed The observed frequency.
      expected The expected frequency.
      residuals The standard residuals, which are calculated by using the following expression: (Standard residuals = (Observed frequency - Expected frequency)/sqrt(Expected frequency).
    • Generated dataChi-square test