This topic describes the Split component provided by Machine Learning Studio. This component randomly splits data to generate the datasets for training and testing.

Split

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Operation
    Parameters Setting Splitting Method
    • Split by Ratio
    • Split by Threshold
    Splitting Fraction Valid values: (0,1)
    Random Seed The random seed, which is automatically generated.
    ID Column (Do Not Split Columns with the Same ID) The ID columns you want to split.
    Note This parameter is available only when you select Advanced Options.
    Threshold Columns The columns that contain the threshold value. Columns of the STRING type are not supported.
    Threshold The data specified by Splitting Fraction must be excluded.
    Tuning Cores The number of cores, which are automatically allocated for training based on the data volume that you entered.
    Memory Size per Core The memory size of each core, in MB. The memory size is automatically allocated based on the data volume that you entered.
  • PAI command
    PAI -name split -project algo_public
        -DinputTableName=wbpc
        -Doutput1TableName=wpbc_split1
        -Doutput2TableName=wpbc_split2
        -Dfraction=0.25;
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The partitions selected from the input table for training. The following formats are supported:
    • Partition_name=value
    • name1=value1/name2=value2: multiple-level partitions
    Note If you specify multiple partitions, separate these partitions with commas (,).
    All partitions
    output1TableName Yes Output table 1. No default value
    output1TablePartition No The names of the partitions in output table 1. Output table 1 is a non-partitioned table.
    output2TableName Yes Output table 2. No default value
    output2TablePartition No The names of the partitions in output table 2. Output table 2 is a non-partitioned table.
    fraction No The percentage of the split data that is allocated to output table 1. Valid values: (0,1). No default value
    randomSeed No The random seed, which must be a positive integer. Automatically allocated
    idColName No The ID column. Columns with the same ID cannot be split. No default value
    thresholdColName No The column that contains the threshold value. Columns of the STRING type are not supported. No default value
    threshold No The threshold. No default value
    lifecycle No The lifecycle of the output table. Valid values: [1,3650]. No default value
    coreNum No The number of cores. Automatically allocated
    memSizePerCore No The memory size of each core, in MB. Valid values: (1,65536). Automatically allocated