The Stratified Sampling component stratifies the input data based on the values of a stratification column and randomly samples data in each stratum.

Configure the component

You can use one of the following methods to configure the Stratified Sampling component.

Method 1: Configure the component on the pipeline page

Configure the component parameters on the pipeline page of Machine Learning Designer.
TabParameterDescription
Fields SettingStratification ColumnThe column that is used for stratification.
Parameters SettingSample SizeThe value must be a positive integer.
Sampling FractionThe value must be a floating-point number. Valid values: (0,1).
Random SeedThe value is automatically generated by the system. The default value is 1234567.
TuningCoresThe value must be a positive integer. By default, the system determines the value.
Memory Size per CoreThe value must be a positive integer. Valid values: (1,65536). By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name StratifiedSample
    -project algo_public \
    -DinputTableName="test_input" \
    -DoutputTableName="test_output" \
    -DstrataColName="label" \
    -DsampleSize="A:200,B:300,C:500" \
    -DrandomSeed=1007 \
    -Dlifecycle=30;
ParameterRequiredDescriptionDefault value
inputTableNameYesThe name of the input table.No default value
inputTablePartitionsNoThe partitions that are selected from the input table for training. The following formats are supported:
  • Partition_name=value
  • name1=value1/name2=value2: multi-level partitions
Note Separate multiple partitions with commas (,)
All partitions
outputTableNameYesThe name of the output table.No default value
strataColNameYesThe name of the column that is used as the key for stratification. No default value
sampleSizeNoThe number of samples.
  • If the value is a positive integer, it indicates the number of samples in each stratum.
  • If the value is a string, the string must be in the format of strata0:n0,strata1:n1. The value after a colon (:) indicates the number of samples that need to be configured for the stratum specified before the colon (:).
Note
  • If both the sampleSize and sampleRatio parameters are empty, an error is returned.
  • If both the sampleSize and sampleRatio parameters are specified, the sampleSize parameter takes precedence.
No default value
sampleRatioNoThe sampling proportion.
  • If the value is a number, it must be a floating-point number between 0 and 1, and the value indicates the sampling proportion of each stratum.
  • If the value is a string, the format must be strata0:r0,strata1:r1. The value after a colon (:) indicates the sampling proportion for the stratum specified before the colon (:).
No default value
randomSeedNoThe random seed. The value must be a positive integer. 123456
lifecycleNoThe lifecycle of the output table. Valid values: [1,3650]. No default value
coreNumNoThe number of cores used in computing. The value must be a positive integer. Determined by the system
memSizePerCoreNoThe memory size of each core. Valid values: (1,65536). Unit: MB. Determined by the system