This topic describes the Split component provided by Machine Learning Designer. This component randomly splits data to generate datasets for training and testing.

Split

You can configure the component by using the Machine Learning Platform for AI (PAI) console or a PAI command.
  • PAI console
    Tab Parameter Description
    Parameters Setting Splitting Method
    • Split by Ratio
    • Split by Threshold
    Splitting Fraction Valid values: (0,1).
    Random Seed The random seed, which is automatically generated.
    ID Column (Do Not Split Columns with the Same ID) The ID column.
    Note This parameter is displayed only if you select Advanced Options.
    Threshold Column The column that contains the threshold value. Columns of the STRING type are not supported.
    Threshold Data in output table 1 must be less than the threshold.
    Notice The data specified by Splitting Fraction must be cleared.
    Tuning Cores The number of cores. The system automatically allocates cores used for training based on the volume of input data.
    Memory Size per Core The memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB.
  • PAI command
    PAI -name split -project algo_public
        -DinputTableName=wbpc
        -Doutput1TableName=wpbc_split1
        -Doutput2TableName=wpbc_split2
        -Dfraction=0.25;
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. None
    inputTablePartitions No The partitions that are selected from the input table for training. The following formats are supported:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note If you specify multiple partitions, separate them with commas (,).
    All partitions
    output1TableName Yes The name of output table 1. None
    output1TablePartition No The names of the partitions in output table 1. Non-partitioned table
    output2TableName Yes The name of output table 2. None
    output2TablePartition No The names of the partitions in output table 2. Non-partitioned table
    fraction No The percentage of the split data that is allocated to output table 1. Valid values: (0,1). None
    randomSeed No The random seed. The value must be a positive integer. Determined by the system
    idColName No The ID column. Columns with the same ID cannot be split. None
    thresholdColName No The column that contains the threshold value. Columns of the STRING type are not supported. None
    threshold No The threshold. None
    lifecycle No The lifecycle of the output table. Valid values: [1,3650]. None
    coreNum No The number of cores. Determined by the system
    memSizePerCore No The memory size of each core. Unit: MB. Valid values: (1,65536). Determined by the system