One-hot encoding can convert the multiple values of a feature into multiple binary features. The binary features are mutually exclusive, and only one feature can be enabled at a time. After on-hot encoding, the output data consists of key-value pairs in the sparse format.

Overview

The One Hot Encoding component provides the training and prediction features.
  • Training feature:
    • Input nodes: The first (left) input node of this component is the input of training data, and the second (right) input node does not need to be set during the training.
    • Output nodes: This component has two output nodes. The left one is the encoded output table and the right one is the output model table. The output model table is used to perform one-hot encoding for new data of the same type.
  • Prediction feature:

    The second (right) input node of the One-Hot Encoding component is used to import a one-hot encoding model. An existing one-hot encoding model can be used to encode new data.

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Binarization Column Required. The fields that require binarization.
    Other Reserved Features The features that are reserved and exported in the key-value format. The selected fields that are exported as features in the key-value format. The fields must be of the DOUBLE type. They are not subject to one-hot encoding and are encoded from 0.
    Appending Columns Optional. The columns that are appended to the output table.
    Parameters Setting Lifecycle The lifecycle of the output table. Default value: 7.
    Output table type The type of the output table. Valid values: kv and table. If the number of the features that require discretization is large, we recommend that you set this parameter to kv. If you set this parameter to table, the output table can contain a maximum of 1,024 columns. If the number of the exported columns exceeds the value, an error is reported.
    Worker number The number of cores.
    Memory Size per Node The memory size of each core. Unit: MB.
    Delete the encoding of the last enumeration If you select the check box, the linear independence of the encoded data is ensured.
    Ignore empty elements in the data to be encoded If you select this check box, empty elements are not encoded.
  • Use commands
    PAI -name one_hot
      -project algo_public
        -DinputTable=one_hot_test
        -DbinaryCols=f0,f1,f2
        -DmodelTable=one_hot_model
        -DoutputTable=one_hot_output
        -Dlifecycle=28;
    Parameter Required Description Default value
    inputTable Yes The name of the input table. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. All partitions of the input table
    binaryCols Yes The fields that require one-hot encoding. These fields must be enumerated features and their data types are not limited. N/A
    reserveCols No The selected fields that are exported as features in the key-value format. The fields must be of the DOUBLE type. They are not subject to one-hot encoding and are encoded from 0. Empty string
    appendCols No The selected fields that are exported to the output table the same as they are in the input table. N/A
    outputTable Yes The output table that is generated after one-hot encoding. The encoding result is stored in the key-value format. N/A
    inputModelTable No The input model table for one-hot encoding.
    Note The value of the inputModelTable or outputModelTable parameter must be a non-empty string.
    Empty string
    outputModelTable No The output model table for one-hot encoding.
    Note The value of the inputModelTable or outputModelTable parameter must be a non-empty string.
    Empty string
    lifecycle No The lifecycle of the output table. 7
    dropLast Yes Specifies whether to delete the encoding result of the last enumerator. If this parameter is set to true, the linear independence of the encoded data is ensured. false
    outputTableType Yes The type of the output table. Valid values: kv and table. If the number of the features that require discretization is large, we recommend that you set this parameter to kv. If you set this parameter to table, the output table can contain a maximum of 1,024 columns. If the number of the exported columns exceeds the value, an error is reported. kv
    ignoreNull Yes Specifies whether to ignore empty elements in the data that requires encoding. If this parameter is set to true, empty elements are not encoded. false
    coreNum No The number of cores. Determined by the system
    memSizePerCore No The memory size of each core. Unit: MB. Valid values: [2048,64 × 1024]. Determined by the system
    Instructions:
    • The value of the inputModelTable or outputModelTable parameter must be a non-empty string. If the value of the inputModelTable parameter is a non-empty string, the table that is indicated by the parameter is a non-empty model table.
    • For the columns that require on-hot encoding, you can specify tens of millions of values for discretization.
    • If you use the training model as the model for encoding next time, you cannot change the values of the dropLast, ignoreNull, and reserveCols parameters. This is because the output results generated by specifying the three parameters are integrated into the model. If you want to modify the three parameters, you must train the model again.
    • We recommend that you export the output table in the key-value format. If you use the table format, you can export a maximum of 1,024 columns. If the number of the exported columns exceeds the value, an error is reported and the encoding fails.
    • By default, the output table in the key-value format that is generated after one-hot encoding is encoded from 0.
    • If you use the trained model to encode new data and cannot find the discrete magnitude of the data in the model mapping table, the discrete magnitude is ignored. This indicates that the discrete magnitude is not encoded. If you need to encode the discrete magnitude, you must train the model mapping table again.

Example

  1. Execute the following SQL statements to generate training data:
    PAI -project projectxlib4
      -name one_hot
      -DinputTable=one_hot_yh
      -DbinaryCols=f0,f2,f4
      -DoutputModelTable=one_hot_model_8
      -DoutputTable=one_hot_in_table_1_output_8
      -DdropLast=false
      -DappendCols=f0,f2,f4
      -DignoreNull=false
      -DoutputTableType=table
      -DreserveCols=f3
      -DcoreNum=4
      -DmemSizePerCore=2048;
  2. Import the data listed in the following table for a test.
    f0 f1 f2 f3 f4
    12 prefix1 1970-09-15 12:50:22 0.1 true
    12 prefix3 1971-01-22 03:15:33 0.4 false
    NULL prefix3 1970-01-01 08:00:00 0.2 NULL
    3 NULL 1970-01-01 08:00:00 0.3 false
    34 NULL 1970-09-15 12:50:22 0.4 NULL
    3 prefix1 1970-09-15 12:50:22 0.2 true
    3 prefix1 1970-09-15 12:50:22 0.3 false
    3 prefix3 1970-01-01 08:00:00 0.2 true
    3 prefix3 1971-01-22 03:15:33 0.1 false
    NULL prefix3 1970-01-01 08:00:00 0.3 false

    In the preceding input table, the f0 column is of the BIGINT type, the f1 column is of the STRING type, the f2 column is of the DATETIME type, the f3 column is of the DOUBLE type, and the f4 column is of the BOOLEAN type.

  3. Obtain the following model mapping table in the test result.
    col_name col_value mapping
    _reserve_ f3 0
    f0 12 1
    f0 3 2
    f0 34 3
    f0 null 4
    f2 22222222222 5
    f2 33333333333 6
    f2 4 7
    f4 0 8
    f4 1 9
    f4 null 10
    The top row in the model mapping table is the reserve row, and the column name is fixed to reserve. This row stores reserve information. The remaining rows correspond to the mapping information of the encoding.
    • Encoded table in the table format
      f0 f1 f3 f4 _reserve__f3_0 f0_12_1 f0_3_2 f0_34_3 f0_null_4 f2_22222222_5 f2_33333333_6 f2_4_7 f4_0_8 f4_1_9 f4_null_10
      12 prefix1 0.1 true 0.1 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
      12 prefix3 0.4 false 0.4 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
      NULL prefix3 0.2 NULL 0.2 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
      3 NULL 0.3 false 0.3 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
      34 NULL 0.4 NULL 0.4 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
      3 prefix1 0.2 true 0.2 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
      3 prefix1 0.3 false 0.3 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
      3 prefix3 0.2 true 0.2 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
      3 prefix3 0.1 false 0.1 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
      NULL prefix3 0.3 false 0.3 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
    • Encoded table in the key-value format
      f0 f1 f3 f4 kv
      12 prefix1 0.1 true 0:0.1,1:1,5:1,9:1
      12 prefix3 0.4 false 0:0.4,1:1,6:1,8:1
      NULL prefix3 0.2 NULL 0:0.2,4:1,7:1,10:1
      3 NULL 0.3 false 0:0.3,2:1,7:1,8:1
      34 NULL 0.4 NULL 0:0.4,3:1,5:1,10:1
      3 prefix1 0.2 true 0:0.2,2:1,5:1,9:1
      3 prefix1 0.3 false 0:0.3,2:1,5:1,8:1
      3 prefix3 0.2 true 0:0.2,2:1,7:1,9:1
      3 prefix3 0.1 false 0:0.1,2:1,6:1,8:1
      NULL prefix3 0.3 false 0:0.3,4:1,7:1,8:1

Scalability test

Test data: The number of samples is 200 million, and the number of enumerators is 100,000. The test data is listed in the following table.
f0 f1
94 prefix3689
9664 prefix5682
2062 prefix5530
9075 prefix9854
9836 prefix1764
5140 prefix1149
3455 prefix7272
2508 prefix7139
7993 prefix1551
5602 prefix4606
3132 prefix5767
The test result is listed in the following table.
core num train time predict time Acceleration ratio
5 84s 181s 1/1
10 60s 93s 1.4/1.95
20 46s 56s 1.8/3.23
Usage notes when you perform the test in the console:
  • Use the One-Hot Encoding component to encode data. The following figure shows the experiment process. Process
  • Use the trained model of the One-Hot Encoding component to encode data. The following figure shows the experiment process. Experiment