The Stratified Sampling component stratifies the input data based on the values of a stratification column and randomly samples data in each stratum.
You can configure the component by using one of the following methods:
- Use the Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Stratification Column The columns that are used for stratification. Parameters Setting Sample Size The value must be a positive integer. Sampling Fraction The value must be a floating-point number. Valid values: (0,1). Random Seed The value is automatically generated by the system. The default value is 1234567. Tuning Cores The value must be a positive integer. By default, the system determines the value. Memory Size per Core The value must be a positive integer. Valid values: (1,65536). By default, the system determines the value. - Use commands
PAI -name StratifiedSample -project algo_public \ -DinputTableName="test_input" \ -DoutputTableName="test_output" \ -DstrataColName="label" \ -DsampleSize="A:200,B:300,C:500" \ -DrandomSeed=1007 \ -Dlifecycle=30;
Parameter Required Description Default value inputTableName Yes The name of the input table. N/A inputTablePartitions No The partitions that are selected from the input table for training. Specify this parameter in one of the following formats: - Partition_name=value
- name1=value1/name2=value2: multi-level partitions
Note Separate multiple partitions with commas (,)All partitions outputTableName Yes The name of the output table. N/A strataColName Yes The name of the column that is used as the key for stratification. N/A sampleSize No The number of samples. - If the value is a positive integer, it indicates the number of samples at each stratum.
- If the value is a string, the string must be in the format of strata0:n0, strata1:n1. The value after a colon (:) indicates the number of samples that need to be configured for the stratum specified before the colon (:).
Note- If both the sampleSize and sampleRatio parameters are empty, an error is returned.
- If both the sampleSize and sampleRatio parameters are not empty, the sampleSize parameter takes precedence.
N/A sampleRatio No The sampling proportion. - If the value is a number, it must be a floating-point number between 0 and 1, and the value indicates the sampling proportion of each stratum.
- If the value is a string, the format must be strata0:r0,strata1:r1. The value after a colon (:) indicates the sampling proportion for the stratum specified before the colon (:).
N/A randomSeed No The random seed. The value must be a positive integer. 123456 lifecycle No The lifecycle of the output table. Valid values: [1,3650]. N/A coreNum No The number of cores. The value must be a positive integer. Determined by the system memSizePerCore No The memory size of each core. Valid values: (1,65536). Unit: MB. Determined by the system