This topic describes the Split component provided by Machine Learning Designer. This component randomly splits data to generate datasets for training and testing.

Configure the component

You can configure the component by using one of the following methods:

Method 1: Use the Machine Learning Platform for AI (PAI) console

Configure the component on the pipeline configuration page of Machine Learning Designer in the PAI console.
Tab Parameter Description
Parameters Setting Splitting Method
  • Split by Ratio
  • Split by Threshold
Splitting Fraction Valid values: (0,1).
Random Seed The random seed, which is automatically generated.
ID Column (Do Not Split Columns with the Same ID) The ID column. Columns with the same ID are not split. Instead, they are randomly allocated to output table 1 or output table 2.
Note This parameter is displayed only if you select Advanced Options. Only a single ID column can be selected.
Threshold Column The threshold column. The content in this column is split based on a threshold. STRING-typed columns cannot be selected.
Threshold The threshold used to split the column specified by Threshold Column. Data in output table 1 must be less than the threshold. Data in output table 2 must be greater than or equal to the threshold.
Important If you want to split data by threshold, the information specified when you set Splitting Method to Split by Ratio must be cleared, such as the Splitting Fraction information.
Tuning Cores The number of cores. The system automatically allocates cores used for training based on the volume of input data.
Memory Size per Core The memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB.

Method 2: Run a PAI command

Configure the component by running a PAI command. You can use the SQL Script component to run PAI commands. For more information, see SQL Script.
PAI -name split -project algo_public
    -DinputTableName=wbpc
    -Doutput1TableName=wpbc_split1
    -Doutput2TableName=wpbc_split2
    -Dfraction=0.25;
Parameter Required Description Default value
inputTableName Yes The name of the input table. None
inputTablePartitions No The partition that is selected from the input table for training. The following formats are supported:
  • Partition_name=value
  • name1=value1/name2=value2: multi-level partitions
Note If you specify multiple partitions, separate them with commas (,).
All partitions
output1TableName Yes The name of output table 1. None
output1TablePartition No The name of the partition in output table 1. Non-partitioned table
output2TableName Yes The name of output table 2. None
output2TablePartition No The name of the partition in output table 2. Non-partitioned table
fraction No The percentage of the split data that is allocated to output table 1. Valid values: (0,1). None
randomSeed No The random seed. The value must be a positive integer. Determined by the system
idColName No The ID column. Columns with the same ID cannot be split. None
thresholdColName No The threshold column. STRING-typed columns cannot be selected. None
threshold No The threshold. None
lifecycle No The lifecycle of the output table. Valid values: [1,3650]. None
coreNum No The number of cores. Determined by the system
memSizePerCore No The memory size of each core. Unit: MB. Valid values: (1,65536). Determined by the system