The Stratified Sampling component stratifies the input data based on the values of a stratification column and randomly samples data in each stratum.

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Stratification Column The columns that are used for stratification.
    Parameters Setting Sample Size The value must be a positive integer.
    Sampling Fraction The value must be a floating-point number. Valid values: (0,1).
    Random Seed The value is automatically generated by the system. The default value is 1234567.
    Tuning Cores The value must be a positive integer. By default, the system determines the value.
    Memory Size per Core The value must be a positive integer. Valid values: (1,65536). By default, the system determines the value.
  • Use commands
    PAI -name StratifiedSample
        -project algo_public \
        -DinputTableName="test_input" \
        -DoutputTableName="test_output" \
        -DstrataColName="label" \
        -DsampleSize="A:200,B:300,C:500" \
        -DrandomSeed=1007 \
        -Dlifecycle=30;
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. Specify this parameter in one of the following formats:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note Separate multiple partitions with commas (,)
    All partitions
    outputTableName Yes The name of the output table. N/A
    strataColName Yes The name of the column that is used as the key for stratification. N/A
    sampleSize No The number of samples.
    • If the value is a positive integer, it indicates the number of samples at each stratum.
    • If the value is a string, the string must be in the format of strata0:n0, strata1:n1. The value after a colon (:) indicates the number of samples that need to be configured for the stratum specified before the colon (:).
    Note
    • If both the sampleSize and sampleRatio parameters are empty, an error is returned.
    • If both the sampleSize and sampleRatio parameters are not empty, the sampleSize parameter takes precedence.
    N/A
    sampleRatio No The sampling proportion.
    • If the value is a number, it must be a floating-point number between 0 and 1, and the value indicates the sampling proportion of each stratum.
    • If the value is a string, the format must be strata0:r0,strata1:r1. The value after a colon (:) indicates the sampling proportion for the stratum specified before the colon (:).
    N/A
    randomSeed No The random seed. The value must be a positive integer. 123456
    lifecycle No The lifecycle of the output table. Valid values: [1,3650]. N/A
    coreNum No The number of cores. The value must be a positive integer. Determined by the system
    memSizePerCore No The memory size of each core. Valid values: (1,65536). Unit: MB. Determined by the system