The Weighted Sampling component generates sampling data based on the values of weighted columns. The values of the weighted columns must be of the DOUBLE or BIGINT type. Weighted columns are sampled based on their values. For example, if the values of two weighted columns are 1.2 and 1.0, the weighted column with the value 1.2 is preferentially sampled.

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Parameters Setting Sample Size The value must be a positive integer.
    Sample Ratio The value must be a floating-point number. Valid values: (0,1).
    Return Sample By default, this check box is not selected. If you select this check box, sampling with replacement is enabled.
    Weight Column You can select weighted columns from the drop-down list. The values of the weighted columns must be of the DOUBLE or BIGINT type. Each value represents the weight of an existing record. Normalization is not required.
    Random Number Seed By default, the system determines the value.
    Tuning Number of Cores The value must be a positive integer. By default, the system determines the value.
    Core Memory Allocation The value must be a positive integer. Valid values: (1,65536). By default, the system determines the value.
  • Use commands
    PAI -name WeightedSample
        -project algo_public \
        -Dlifecycle="28" \
        -DoutputTableName="test2" \
        -DprobCol="previous" \
        -Dreplace="false" \
        -DsampleSize="500" \
        -DinputPartitions="pt=20150501" \
        -DinputTableName="bank_data_partition";
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. Specify this parameter in one of the following formats:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note Separate multiple partitions with commas (,)
    All partitions
    outputTableName Yes The name of the output table. N/A
    sampleSize No The number of samples.
    Note
    • If both the sampleSize and sampleRatio parameters are empty, an error is returned.
    • If both the sampleSize and sampleRatio parameters are not empty, the sampleSize parameter takes precedence.
    N/A
    sampleRatio No The sampling proportion. The value must be a floating-point number. Valid values: (0,1). N/A
    probCol Yes The weighted columns. Each value represents the weight of an existing record. Normalization is not required. The values of the weighted columns must be of the DOUBLE or BIGINT type. N/A
    replace No Specifies whether to enable sampling with replacement. The value must be of the BOOLEAN type. false, which indicates that sampling with replacement is not enabled.
    randomSeed No The random seed. The value must be a positive integer. Determined by the system
    lifecycle No The lifecycle of the output table. Valid values: [1,3650]. N/A
    coreNum No The number of cores. The value must be a positive integer. Determined by the system
    memSizePerCore No The memory size of each core. Valid values: (1,65536). Unit: MB. Determined by the system