Two Sample T Test - Platform For AI - Alibaba Cloud Documentation Center

The Two Sample T Test component is used to check whether the population means from two samples are significantly different from each other based on the principles of statistics. This topic describes how to configure parameters for the Two Sample T Test component provided by Machine Learning Designer (formerly known as Machine Learning Studio). This topic also provides an example on how to use the Two Sample T Test component.

Configure the component

You can use one of the following methods to configure the Two Sample T Test component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Two Sample T Test component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Sample 1 Column	The column that contains Sample 1.
Fields Setting	Sample 2 Column	The column that contains Sample 2.
Parameters Setting	T Test Type	The type of the T test that you want to perform. Valid values: Independent T Test: Check whether the population means from two independent samples are significantly different from each other. The two samples tested must be independent of each other and generally have a normal distribution. Paired T Test: Check whether the population means from two paired samples are significantly different from each other.
	Alternative Hypothesis Type	The type of alternative hypothesis. Valid values: two.sided: Check whether a population mean is either greater than or less than a hypothesized value. less: Check whether a population mean is less than a hypothesized value. greater: Check whether a population mean is greater than a hypothesized value.
	Confidence Level	The confidence level of the test result. Valid values: 0.8, 0.9, 0.95, 0.99, 0.995, and 0.999.
	Hypothesized Mean	The hypothesized mean. Default value: 0.
	Variances of Two Populations Are Equal	Specifies whether the variances of two populations are equal. Valid values: true and false.
	Cores	The number of cores. The value must be a positive integer. This parameter must be used with the Memory Size Per Core parameter. Valid values: 1 to 9999.
	Memory Size Per Core	The memory size of each core. Unit: MB. The value must be a positive integer. Valid values: 1024 to 65536.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

pai -name t_test 
    -project algo_public 
    -DxTableName=pai_t_test_all_type
    -DxColName=col1_double
    -DxTablePartitions=ds=2010/dt=1
    -DyTableName=pai_t_test_all_type
    -DyColName=col1_double
    -DyTablePartitions=ds=2010/dt=1 
    -DoutputTableName=pai_t_test_out
    -Dalternative=less
    -Dmu=47
    -DconfidenceLevel=0.95
    -Dpaired=false
    -DvarEqual=true

Parameter	Required	Description	Default value
xTableName	Yes	The name of Input Table x.	N/A
xTablePartitions	No	The one or more partitions in Input Table x that are used in the T test. The following formats are supported: Partition_name=value name1=value1/name2=value2: multi-level partitions Note If you specify multiple partitions, separate them with commas (,).	All partitions
xColName	Yes	The column in Input Table x that is used in the T test. The value must be of the DOUBLE or INT type.	N/A
yTableName	Yes	The name of Input Table y.	N/A
yTablePartitions	No	The one or more partitions in Input Table y that are used in the T test. The following formats are supported: Partition_name=value name1=value1/name2=value2: multi-level partitions Note If you specify multiple partitions, separate them with commas (,).	All partitions
yColName	Yes	The column in Input Table y that is used in the T test. The value must be of the DOUBLE or INT type.	N/A
paired	No	true: paired T test false: independent T test	false
alternative	No	The type of alternative hypothesis. Valid values: two.sided, less, and greater.	two.sided
mu	No	The hypothesized mean. The value must be of the DOUBLE type.	0
varEqual	No	Specifies whether the variances of two populations are equal. Valid values: true and false.	false
confidenceLevel	No	The confidence level of the test result. Valid values: 0.8, 0.9, 0.95, 0.99, 0.995, and 0.999.	0.95
coreNum	No	The number of cores. The value must be a positive integer. This parameter must be used with the memSizePerCore parameter. Valid values: 1 to 9999.	Determined by the system
memSizePerCore	No	The memory size of each core. Unit: MB. The value must be a positive integer. Valid values: 1024 to 65536.	Determined by the system
lifecycle	No	The lifecycle of the output table.	N/A

If the input tables are regular tables but not partitioned tables, we recommend that you do not set the coreNum and memSizePerCore parameters. Instead, use the default values determined by the system. If you do not have sufficient computing resources, use the following code to calculate the amount of computing resources needed:

def CalcCoreNumAndMem(row,centerCount,kOneCoreDataSize=1024):
    """Calculate the number of cores and memory size of each core.            
       Args:
           row: the number of rows in an input table. 
           centerCount: the number of columns in an input table. 
           kOneCoreDataSize: the amount of data that can be computed by each core. Unit: MB. The value must be a positive integer. Default value: 1024. 
       Return:
           coreNum,memSizePerCore                 
       Example:
           coreNum,memSizePerCore = CalcCoreNumAndMem(1000,99,100,kOneCoreDataSize=2048)

    """
    kMBytes = 1024.0 * 1024.0
    # The number of cores involved in computing. 
    coreNum = max(1, int(row * 2 * 8 / kMBytes / kOneCoreDataSize))
    # Memory size per core = Data amount. 
    memSizePerCore = max(1024,int(kOneCoreDataSize * 2))
    return coreNum,memSizePerCore

Example

Test data

create table pai_test_input as
select * from
(
  select 1 as f0,2 as f1 from dual
  union all
  select 1 as f0,3 as f1 from dual
  union all
  select 1 as f0,4 as f1 from dual
  union all
  select 0 as f0,3 as f1 from dual
  union all
  select 0 as f0,4 as f1 from dual
)tmp;

PAI command

pai -name t_test 
    -project algo_public 
    -DxTableName=pai_test_input
    -DxColName=f0
    -DyTableName=pai_test_input
    -DyColName=f1
    -DyTablePartitions=ds=2010/dt=1 
    -DoutputTableName=pai_t_test_out
    -Dalternative=less
    -Dmu=47
    -DconfidenceLevel=0.95
    -Dpaired=false
    -DvarEqual=true

Output

The output table is in the JSON format and contains only one row and one column.

{
    "AlternativeHypthesis": "difference in means not equals to 0",
    "ConfidenceInterval": "(-2.5465, -0.4535)",
    "ConfidenceLevel": 0.95,
    "alpha": 0.05000000000000004,
    "df": 19,
    "mean of the differences": -1.5,
    "p": 0.008000000000000007,
    "t": -3
}