Weighted Sampling - Platform For AI - Alibaba Cloud Documentation Center

The Weighted Sampling component generates sampling data based on the values of weighted columns. The values of the weighted columns must be of the DOUBLE or BIGINT type. Weighted columns are sampled based on their values. For example, if the values of two weighted columns are 1.2 and 1.0, the weighted column with the value 1.2 is preferentially sampled.

Configure the component

You can use one of the following methods to configure the Weighted Sampling component.

Method 1: Configure the component on the pipeline page

Configure the component parameters on the pipeline page of Machine Learning Designer.

Tab	Parameter	Description
Parameters Setting	Sample Size	The value must be a positive integer.
	Sampling Fraction	The value must be a floating-point number. Valid values: (0,1).
	Sampling with Replacement	By default, this check box is not selected. If you select this check box, sampling with replacement is enabled.
	Weight Columns	The weighted columns. The values of the weighted columns must be of the DOUBLE or BIGINT type. Each value represents the weight of an existing record. Normalization is not required.
	Random Seed	By default, the system determines the value.
Tuning	Cores	The value must be a positive integer. By default, the system determines the value.
Tuning	Memory Size per Core	The value must be a positive integer. Valid values: (1,65536). By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name WeightedSample
    -project algo_public
    -Dlifecycle="28"
    -DoutputTableName="test2"
    -DprobCol="previous"
    -Dreplace="false"
    -DsampleSize="500"
    -DinputPartitions="pt=20150501"
    -DinputTableName="bank_data_partition";

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	No default value
inputTablePartitions	No	The partitions that are selected from the input table for training. The following formats are supported: Partition_name=value name1=value1/name2=value2: multi-level partitions Note Separate multiple partitions with commas (,)	All partitions
outputTableName	Yes	The name of the output table.	No default value
sampleSize	No	The number of samples. Note If both the sampleSize and sampleRatio parameters are empty, an error is returned. If both the sampleSize and sampleRatio parameters are specified, the sampleSize parameter takes precedence.	No default value
sampleRatio	No	The sampling proportion. The value must be a floating-point number. Valid values: (0,1).	No default value
probCol	Yes	The weighted columns. Each value represents the weight of an existing record. Normalization is not required. The values of the weighted columns must be of the DOUBLE or BIGINT type.	No default value
replace	No	Specifies whether to enable sampling with replacement. The value must be of the BOOLEAN type.	false, which indicates that sampling with replacement is disabled
randomSeed	No	The random seed. The value must be a positive integer.	Determined by the system
lifecycle	No	The lifecycle of the output table. Valid values: [1,3650].	No default value
coreNum	No	The number of cores used in computing. The value must be a positive integer.	Determined by the system
memSizePerCore	No	The memory size of each core. Valid values: (1,65536). Unit: MB.	Determined by the system