All Products
Search
Document Center

Platform For AI:Two Sample T Test

Last Updated:Oct 26, 2023

The Two Sample T Test component is used to check whether the population means from two samples are significantly different from each other based on the principles of statistics. This topic describes how to configure parameters for the Two Sample T Test component provided by Machine Learning Designer (formerly known as Machine Learning Studio). This topic also provides an example on how to use the Two Sample T Test component.

Configure the component

You can use one of the following methods to configure the Two Sample T Test component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Two Sample T Test component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Sample 1 Column

The column that contains Sample 1.

Sample 2 Column

The column that contains Sample 2.

Parameters Setting

T Test Type

The type of the T test that you want to perform. Valid values:

  • Independent T Test: Check whether the population means from two independent samples are significantly different from each other. The two samples tested must be independent of each other and generally have a normal distribution.

  • Paired T Test: Check whether the population means from two paired samples are significantly different from each other.

Alternative Hypothesis Type

The type of alternative hypothesis. Valid values:

  • two.sided: Check whether a population mean is either greater than or less than a hypothesized value.

  • less: Check whether a population mean is less than a hypothesized value.

  • greater: Check whether a population mean is greater than a hypothesized value.

Confidence Level

The confidence level of the test result. Valid values: 0.8, 0.9, 0.95, 0.99, 0.995, and 0.999.

Hypothesized Mean

The hypothesized mean. Default value: 0.

Variances of Two Populations Are Equal

Specifies whether the variances of two populations are equal. Valid values: true and false.

Cores

The number of cores. The value must be a positive integer. This parameter must be used with the Memory Size Per Core parameter. Valid values: 1 to 9999.

Memory Size Per Core

The memory size of each core. Unit: MB. The value must be a positive integer. Valid values: 1024 to 65536.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

pai -name t_test 
    -project algo_public 
    -DxTableName=pai_t_test_all_type
    -DxColName=col1_double
    -DxTablePartitions=ds=2010/dt=1
    -DyTableName=pai_t_test_all_type
    -DyColName=col1_double
    -DyTablePartitions=ds=2010/dt=1 
    -DoutputTableName=pai_t_test_out
    -Dalternative=less
    -Dmu=47
    -DconfidenceLevel=0.95
    -Dpaired=false
    -DvarEqual=true

Parameter

Required

Description

Default value

xTableName

Yes

The name of Input Table x.

N/A

xTablePartitions

No

The one or more partitions in Input Table x that are used in the T test. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate them with commas (,).

All partitions

xColName

Yes

The column in Input Table x that is used in the T test. The value must be of the DOUBLE or INT type.

N/A

yTableName

Yes

The name of Input Table y.

N/A

yTablePartitions

No

The one or more partitions in Input Table y that are used in the T test. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate them with commas (,).

All partitions

yColName

Yes

The column in Input Table y that is used in the T test. The value must be of the DOUBLE or INT type.

N/A

paired

No

  • true: paired T test

  • false: independent T test

false

alternative

No

The type of alternative hypothesis. Valid values: two.sided, less, and greater.

two.sided

mu

No

The hypothesized mean. The value must be of the DOUBLE type.

0

varEqual

No

Specifies whether the variances of two populations are equal. Valid values: true and false.

false

confidenceLevel

No

The confidence level of the test result. Valid values: 0.8, 0.9, 0.95, 0.99, 0.995, and 0.999.

0.95

coreNum

No

The number of cores. The value must be a positive integer. This parameter must be used with the memSizePerCore parameter. Valid values: 1 to 9999.

Determined by the system

memSizePerCore

No

The memory size of each core. Unit: MB. The value must be a positive integer. Valid values: 1024 to 65536.

Determined by the system

lifecycle

No

The lifecycle of the output table.

N/A

If the input tables are regular tables but not partitioned tables, we recommend that you do not set the coreNum and memSizePerCore parameters. Instead, use the default values determined by the system. If you do not have sufficient computing resources, use the following code to calculate the amount of computing resources needed:

def CalcCoreNumAndMem(row,centerCount,kOneCoreDataSize=1024):
    """Calculate the number of cores and memory size of each core.            
       Args:
           row: the number of rows in an input table. 
           centerCount: the number of columns in an input table. 
           kOneCoreDataSize: the amount of data that can be computed by each core. Unit: MB. The value must be a positive integer. Default value: 1024. 
       Return:
           coreNum,memSizePerCore                 
       Example:
           coreNum,memSizePerCore = CalcCoreNumAndMem(1000,99,100,kOneCoreDataSize=2048)

    """
    kMBytes = 1024.0 * 1024.0
    # The number of cores involved in computing. 
    coreNum = max(1, int(row * 2 * 8 / kMBytes / kOneCoreDataSize))
    # Memory size per core = Data amount. 
    memSizePerCore = max(1024,int(kOneCoreDataSize * 2))
    return coreNum,memSizePerCore

Example

  • Test data

    create table pai_test_input as
    select * from
    (
      select 1 as f0,2 as f1 from dual
      union all
      select 1 as f0,3 as f1 from dual
      union all
      select 1 as f0,4 as f1 from dual
      union all
      select 0 as f0,3 as f1 from dual
      union all
      select 0 as f0,4 as f1 from dual
    )tmp;
  • PAI command

    pai -name t_test 
        -project algo_public 
        -DxTableName=pai_test_input
        -DxColName=f0
        -DyTableName=pai_test_input
        -DyColName=f1
        -DyTablePartitions=ds=2010/dt=1 
        -DoutputTableName=pai_t_test_out
        -Dalternative=less
        -Dmu=47
        -DconfidenceLevel=0.95
        -Dpaired=false
        -DvarEqual=true
  • Output

    The output table is in the JSON format and contains only one row and one column.

    {
        "AlternativeHypthesis": "difference in means not equals to 0",
        "ConfidenceInterval": "(-2.5465, -0.4535)",
        "ConfidenceLevel": 0.95,
        "alpha": 0.05000000000000004,
        "df": 19,
        "mean of the differences": -1.5,
        "p": 0.008000000000000007,
        "t": -3
    }