edit-icon download-icon

Data preprocessing

Last Updated: Aug 15, 2018

Contents

Weighted sampling

Sampling data is generated in the weighted mode. The weight column must be of Double or Int type. Sampling is performed based on the value of the weight column. For example, data with a col value of 1.2 has a higher probability to be sampled than data with a col value of 1.0.

Parameter settings

Parameter box

weighted_sample_param

  • You can manually enter the number of samples (or the sampling ratio).
  • You can choose whether to enable sampling with replacement, which is disabled by default. To enable it, select the checkbox.
  • Select the weighted column from the drop-down list. The weighted column supports the double and bigint types.
  • The random seed can be configured. By default, it is automatically assigned.

PAI command

  1. PAI -name WeightedSample -project algo_public
  2. -DprobCol="previous"
  3. -DsampleSize="500" \
  4. -DoutputTableName="test2"
  5. -DinputPartitions="pt=20150501"
  6. -DinputTableName="bank_data_partition";

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value. The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
outputTableName (Required) Name of the output table NA NA
sampleSize (Optional) Number of samples Positive integer Null by default
sampleRatio (Optional) sampling ratio Floating-point number in the range of (0,1) Null by default
probCol (Required) Columns to be weighted. Each value indicates the weight of a record. You do not need to normalize them. NA NA
replace (Optional) Indicates whether to enable sampling with replacement. The type is boolean. true / false false (disabled by default)
randomSeed (Optional) Random seed Positive integer Automatically generated by default
lifecycle (Optional) Life cycle of the output table Positive integer in the range of [1,3650] No life cycle for the output table
coreNum (Optional) Number of computing cores Positive integer Automatically assigned
memSizePerCore (Optional) Memory size of each core, in MB Positive integer in the range of (1, 65536) Automatically assigned

Note:

  • An error is reported if sampleSize and sampleRatio are both left blank.
  • If sampleSize and sampleRatio are both set, sampleSize prevails.

Random sampling

Data is randomly sampled. Each sampling is independent.

Parameter settings

random_sample_param_setting

random_sample_opt

  • You can manually enter the number of samples (or the sampling ratio).
  • You can choose whether to enable sampling with replacement, which is disabled by default. To enable it, select the checkbox.
  • The random seed can be configured. By default, it is automatically assigned.
  • You can configure the number of concurrent computing cores and memory size. By default, they are automatically assigned.

PAI command

  1. pai -name RandomSample -project algo_public \
  2. -DinputTableName=wbpc \
  3. -DoutputTableName=wpbc_sample \
  4. -DsampleSize=100 \
  5. -Dreplace=false \
  6. -DrandomSeed=1007;

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
outputTableName (Required) Name of the output table NA NA
sampleSize (Optional) Number of samples NA Null by default
sampleRatio (Optional) Sampling ratio in the range of (0, 1) NA Null by default
replace (Optional) Indicates whether to enable sampling with replacement. The type is boolean. true / false false (disabled by default)
randomSeed (Optional) Random seed Positive integer Automatically generated by default
lifecycle (Optional) Life cycle of the output table Positive integer in the range of [1,3650] No life cycle for the output table
coreNum (Optional) Number of computing cores Positive integer Automatically assigned
memSizePerCore (Optional) Memory size of each core, in MB Positive integer in the range of (1, 65536) Automatically assigned

Note:

  • An error is reported if sampleSize and sampleRatio are both left blank.
  • If sampleSize and sampleRatio are both set, sampleSize prevails.

Filtering and mapping

Data can be filtered according to the filtering expression. Fields can be renamed.

Parameter settings

  1. Use the WHERE conditions to filter data, which is similar to SQL statements, for example:
    screenshot
    Filtering conditions: The following operators are supported “=”, “!=”, “>”, “<”, “>=”, “<=”, “like”, and “rlike”.

  2. Rename fields.
    dd

PAI command

  1. PAI -name Filter -project algo_public
  2. -DoutTableName="test_9"
  3. -DinputPartitions="pt=20150501"\
  4. -DinputTableName="bank_data_partition"
  5. -Dfilter="age>=40";
  • name: Component name.
  • project: Project name used to specify the space of an algorithm. The default value is algo_public. If you change the name, the system reports an error.
  • outTableName: Name of the output table.
  • inputPartitions: (optional) Partitions in the training input table Input partitions corresponding to the input table. If the entire table is selected, the value is None.
  • inputTableName: Name of the input table.
  • filter: WHERE filtering conditions. The following operators are supported “=”, “!=”, “>”, “<”, “>=”, “<=”, “like”, and “rlike”.

Stratified sampling

Random samples of certain data or a certain proportion of data are collected from the data set.

Parameter settings

Parameter box

stratified_sample

  • Select the group column from the drop-down list (a maximum of 100 groups are supported).
  • You can manually enter the number of samples of the group (or the sampling ratio).
  • You can configure the random seed. The default value is 1234567.
  • You can configure the number of concurrent computing cores and memory size. By default, they are automatically assigned.

PAI command

  1. PAI -name StratifiedSample -project algo_public
  2. -DinputTableName="test_input"
  3. -DoutputTableName="test_output"
  4. -DstrataColName="label"
  5. -DsampleSize="A:200,B:300,C:500"
  6. -DrandomSeed=1007
  7. -Dlifecycle=30

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions in the training input table. NA The entire table is selected by default.
outputTableName (Required) Name of the output table NA NA
strataColName (Required) Stratifying column. Used as the key for stratifying. NA NA
sampleSize (Optional) Sample size. If the value is an integer, it indicates the number of samples of each stratum. If the value is a string, the format is strata0:n0,strata1:n1, which indicates the number of samples needs to be configured for each stratum. NA NA
sampleRatio (Optional) Sampling ratio. If the value is a number, the value range is (0,1), and the value indicates the sampling ratio of each stratum. If the value is a string, the format is strata0:r0,strata1:r1, which indicates the sampling ratio needs to be configured for each stratum. NA NA
randomSeed (Optional) Random seed NA Default value: 123456
lifecycle (Optional) Life cycle of the output table NA This parameter is not set by default.
coreNum (Optional). Number of cores NA By default, the value is automatically assigned.
memSizePerCore (Optional) Memory size used by each core NA By default, the value is automatically assigned.

Note:

  • An error is reported if sampleSize and sampleRatio are both left blank.
  • If sampleSize and sampleRatio are both set, sampleSize prevails.

Example

Source data

id outlook temperature humidity windy play
0 Sunny 85 85 false No
1 Sunny 80 90 true No
2 Overcast 83 86 false Yes
3 Rainy 70 96 false Yes
4 Rainy 68 80 false Yes
5 Rainy 65 70 true No
6 Overcast 64 65 true Yes
7 Sunny 72 95 false No
8 Sunny 69 70 false Yes
9 Rainy 75 80 false Yes
10 Sunny 75 70 true Yes
11 Overcast 72 90 true Yes
12 Overcast 81 75 false Yes
13 Rainy 71 91 true No
  1. Create an experiment.

    stratified_sampling_demo_exp

  2. Select the group column.

    stratified_sampling_demo_exp_select_strata_column
    Select the play column as the group column.

  3. Configure the number of samples.

    stratified_sampling_demo_exp_config_param
    Collect four samples from the group of “play=Yes” and three samples from the group of “play=No”.

  4. Sampling result.

    stratified_sampling_demo_exp_result

join

Through association information, two tables are merged into one table, and the output fields are determined. This is similar to the join statement of SQL.

Parameter settings

Set the parameters

screenshot

  • Supported connection types: left connection, internal connection, right connection, and full connection.
  • The association condition must be equations.
  • You can manually add or delete association conditions.

PAI command

No PAI command is provided.

Merge columns

Merge the data in two tables by column. The two tables must have the same number of rows.

Parameter settings

See in the following figure:

screenshot

Select input columns from the left table.

screenshot

Select input columns from the right table.

screenshot

  • The two tables selected must have the same number of rows.

  • The names of output columns selected from the left and right tables cannot be the same.

  • When selecting an output field column, you can change the output field name.

  • If no output columns are selected from the left or right table, the full table is output by default. If Whether to Automatically Rename the Output Column is selected, the duplicate column is renamed and then output.

PAI command

  1. PAI -name AppendColumns -project algo_public
  2. -DoutputTableColNames="petal_length,petal_width,petal_length2,petal_width2"\
  3. -DautoRenameCol="false"
  4. -DoutputTableName="pai_temp_770_6840_1"
  5. -DinputTableNames="iris_twopartition,iris_twopartition"\
  6. -DinputPartitionsInfoList="dt=20150125/dp=20150124;dt=20150124/dp=20150123" \
  7. -DselectedColNamesList="petal_length,petal_width;sepal_length,sepal_width";
  • name: Component name.

  • project: Project name, used to specify the space of an algorithmThe default value is algo_public. If you change the name, the system reports an error.

  • outputTableColNames: Names of the columns in the new table Names are separated by commas (,). If autoRenameCol is set to true, this parameter is invalid.

  • autoRenameCol: (optional) Indicates whether to automatically rename the columns in the output table. If the value is true, the columns are renamed. If the value is false, the columns are not renamed. The default value is false.

  • outputTableName: Name of the output table.

  • inputTableNames: Name of the input table. If multiple input tables are provided, use commas (,) to separate the names.

  • inputPartitionsInfoList: (optional) List of the selected partitions corresponding to the input table. Partitions of the same table are separated by commas (,) and partitions of different tables are separated by semicolons (;).

  • selectedColNamesList: Names of columns selected for input. Names of columns in the same table are separated by commas (,) and names of columns in different tables are separated by semicolons (;).

UNION

To merge the data in two tables by row, the numbers and types of the fields selected for output from the left and right tables must be the same. The functions of union and union all are integrated.

Parameter settings

Adjust the parameters as follows:

screenshot

  • During the union operation, the numbers of rows selected from the left and right tables must be the same, and the types of corresponding columns must be the same.

  • You can manually enter the filtering condition for the selected fields in the condition box based on requirements (full table by default). The following operators are supported: “=”, “!=”, “>”, “<”, “>=”, “<=”, “like”, and “rlike”.

  • Distinct is selected by default. After this parameter is selected, duplicate rows in the generated data table are eliminated.

Union columns in the left table.
screenshot

Union columns in the right table.
screenshot

PAI command

No PAI command is provided.

Append ID columns

Append an ID column at the first column of the data table and save the content as a new table. The type of the appended ID column is bigint.

Parameter settings

Skipped

PAI command

  1. PAI -name AppendId -project algo_public
  2. -DIDColName="append_id"
  3. -DoutputTableName="test_11"
  4. -DinputTableName="bank_data" \
  5. -DselectedColNames="age,campaign,cons_conf_idx,cons_price_idx,emp_var_rate,euribor3m,nr_employed,pdays,poutcome,previous,y";
  • name: Component name.

  • project: Project name, used to specify the space of an algorithmThe default value is algo_public. If you change the name, the system reports an error.

  • IDColName: Name of the appended ID column, which is numbered from 0 and 1, 2, 3, and so on.

  • outputTableNames: Names of the output tables.

  • inputTableName: Name of the input table.

  • selectedColNames: Names of the fields to be retained. If multiple fields are selected, separate them with commas (,).

Split

Background

Split an input table or a partition by a certain rate, and write into two output tables.

Algorithm component

  • Split the component, corresponding to two output columns.

    split_icon_intro

  • As shown in the figure, if the rate is 0.6, the output column on the left accounts for 60% of data, and the output column on the right accounts for 40% of data.

    splig_param_setting

PAI command

  1. pai -name split -project algo_public \
  2. -DinputTableName=wbpc \
  3. -Doutput1TableName=wpbc_split1 \
  4. -Doutput2TableName=wpbc_split2 \
  5. -Dfraction=0.25;

Parameter settings

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value. The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
output1TableName (Required) Name of output table 1 NA NA
output1TablePartition (Optional) Partitions in output table 1 NA Output table 1 is not partitioned.
output2TableName (Required) Name of output table 2 NA NA
output2TablePartition (Optional) Partitions in output table 2 NA Output table 2 is not partitioned.
fraction (Required) Proportion of data allocated to output table 1 after splitting (0,1) NA
lifecycle (Optional) Life cycle of the output table Positive integer in the range of [1,3650] No life cycle for the output table

Fill in missing values

Replace a null or specified value with the maximum, minimum, average, or custom value. A configuration list for missing values is defined to fill the input table with missing values that are specified.

  • Replace a numeric null with the maximum, minimum, average, or custom value.
  • Replace a null or empty string or a character-type null or empty string with a custom value.
  • The missing values to be filled in can be null, empty characters, or custom values.
  • If you set the missing values to be empty characters, the type of the target column to be filled must be string.
  • You can define the numeric replacement, or directly choose to replace it with the maximum, minimum, or average value.

UI for filling in missing values

image

The two input columns correspond to the following parameters respectively:

  • inputTableName: Name of the input table to be filled.

  • inputParaTableName: Name of the configuration input table, that is, the parameter list generated by the missing value filling node. Based on this parameter, configuration parameters in one table can be applied to a new table.

The two output columns correspond to the following parameters respectively:

  • outputTableName: Name of the output table after being filled.

  • outputParaTableName: Name of the output parameter table, which is applied to other datasets.

UI of missing value filling parameters

The filled field, original value, and new value form the config parameter and correspond to the column name, original value, and new value parts of the config parameter.

image

PAI command

  1. PAI -name FillMissingValues -project algo_public
  2. -Dconfigs="poutcome,null-empty,testing" \
  3. -DoutputTableName="test_3"
  4. -DinputPartitions="pt=20150501"
  5. -DinputTableName="bank_data_partition";

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions in the training input table. NA The entire table is selected by default.
outputTableName (Required) Name of the output table NA NA
configs (Required) Configuration of missing value filling Example format: “col1, null, 3.14; col2, empty, hello; col3, empty-null, world”, of which, null indicates a null value, and empty indicates empty characters. If you set the missing values to be empty characters, the type of the target column to be filled must be string. If the maximum, minimum, or average value is used, a variable can be used. The naming format is min, max, and mean. If the replaced value is customized, use a custom value in the format of “col4,user-defined,str,str123”. NA NA
outputParaTableName (Required) Configuration output table NA NA
inputParaTableName (Optional) Configuration input table NA No input by default
lifecycle (Optional) Life cycle of the output table NA This parameter is not set by default.
coreNum (Optional). Number of cores NA By default, the value is automatically assigned.
memSizePerCore (Optional) Memory size used by each core NA By default, the value is automatically assigned.

Example

Test data

SQL statement for data generation

  1. drop table if exists fill_missing_values_test_input;
  2. create table fill_missing_values_test_input(
  3. col_string string,
  4. col_bigint bigint,
  5. col_double double,
  6. col_boolean boolean,
  7. col_datetime datetime);
  8. insert overwrite table fill_missing_values_test_input
  9. select
  10. *
  11. from
  12. (
  13. select
  14. '01' as col_string,
  15. 10 as col_bigint,
  16. 10.1 as col_double,
  17. True as col_boolean,
  18. cast('2016-07-01 10:00:00' as datetime) as col_datetime
  19. from dual
  20. union all
  21. select
  22. cast(null as string) as col_string,
  23. 11 as col_bigint,
  24. 10.2 as col_double,
  25. False as col_boolean,
  26. cast('2016-07-02 10:00:00' as datetime) as col_datetime
  27. from dual
  28. union all
  29. select
  30. '02' as col_string,
  31. cast(null as bigint) as col_bigint,
  32. 10.3 as col_double,
  33. True as col_boolean,
  34. cast('2016-07-03 10:00:00' as datetime) as col_datetime
  35. from dual
  36. union all
  37. select
  38. '03' as col_string,
  39. 12 as col_bigint,
  40. cast(null as double) as col_double,
  41. False as col_boolean,
  42. cast('2016-07-04 10:00:00' as datetime) as col_datetime
  43. from dual
  44. union all
  45. select
  46. '04' as col_string,
  47. 13 as col_bigint,
  48. 10.4 as col_double,
  49. cast(null as boolean) as col_boolean,
  50. cast('2016-07-05 10:00:00' as datetime) as col_datetime
  51. from dual
  52. union all
  53. select
  54. '05' as col_string,
  55. 14 as col_bigint,
  56. 10.5 as col_double,
  57. True as col_boolean,
  58. cast(null as datetime) as col_datetime
  59. from dual
  60. ) tmp;

Input data description

col_string col_bigint col_double col_boolean col_datetime
04 13 10.4 NULL 2016-07-05 10:00:00
02 NULL 10.3 true 2016-07-03 10:00:00
03 12 NULL false 2016-07-04 10:00:00
NULL 11 10.2 false 2016-07-02 10:00:00
01 10 10.1 true 2016-07-01 10:00:00
05 14 10.5 true NULL

Running command

  1. drop table if exists fill_missing_values_test_input_output;
  2. drop table if exists fill_missing_values_test_input_model_output;
  3. PAI -name FillMissingValues
  4. -project algo_public
  5. -Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00"
  6. -DoutputParaTableName="fill_missing_values_test_input_model_output"
  7. -Dlifecycle="28"
  8. -DoutputTableName="fill_missing_values_test_input_output"
  9. -DinputTableName="fill_missing_values_test_input";
  10. drop table if exists fill_missing_values_test_input_output_using_model;
  11. drop table if exists fill_missing_values_test_input_output_using_model_model_output;
  12. PAI -name FillMissingValues
  13. -project algo_public
  14. -DoutputParaTableName="fill_missing_values_test_input_output_using_model_model_output"
  15. -DinputParaTableName="fill_missing_values_test_input_model_output"
  16. -Dlifecycle="28"
  17. -DoutputTableName="fill_missing_values_test_input_output_using_model"
  18. -DinputTableName="fill_missing_values_test_input";

Running result

fill_missing_values_test_input_output

  1. +------------+------------+------------+-------------+--------------+
  2. | col_string | col_bigint | col_double | col_boolean | col_datetime |
  3. +------------+------------+------------+-------------+--------------+
  4. | 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 |
  5. | 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 |
  6. | 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 |
  7. | str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 |
  8. | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 |
  9. | 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 |
  10. +------------+------------+------------+-------------+--------------+

fill_missing_values_test_input_model_output

  1. +------------+------------+
  2. | feature | json |
  3. +------------+------------+
  4. | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} |
  5. | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} |
  6. | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} |
  7. | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} |
  8. | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} |
  9. +------------+------------+

fill_missing_values_test_input_output_using_model

  1. +------------+------------+------------+-------------+--------------+
  2. | col_string | col_bigint | col_double | col_boolean | col_datetime |
  3. +------------+------------+------------+-------------+--------------+
  4. | 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 |
  5. | 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 |
  6. | 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 |
  7. | str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 |
  8. | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 |
  9. | 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 |
  10. +------------+------------+------------+-------------+--------------+

fill_missing_values_test_input_output_using_model_model_output

  1. +------------+------------+
  2. | feature | json |
  3. +------------+------------+
  4. | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} |
  5. | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} |
  6. | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} |
  7. | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} |
  8. | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} |
  9. +------------+------------+

Normalization

  • Normalize one or multiple columns in a table and save the generated data in a new table.

  • Linear function transformation is supported, and the expression is y=(x-MinValue)/(MaxValue-MinValue), where MaxValue and MinValue respectively indicate the maximum value and minimum value of the samples.

  • You can choose whether to retain the original columns. If you select the corresponding check box, the original columns are retained, and processed columns are renamed.

  • Click the Select Field and select the columns to be normalized. The double and bigint types are supported.

Normalization UI

image

The two input columns correspond to the following parameters respectively:

  • inputTableName: Name of the input table to be normalized.

  • inputParaTableName: Name of the configuration input table, that is, the parameter list generated by the normalization node. This parameter can be used to apply configurations of one table to a new table.

The two output columns correspond to the following parameters respectively:

  • outputTableName: Name of the output table after normalization.

  • outputParaTableName: Name of the output parameter table, which is applied to other datasets.

Normalization parameter UI

The keepOriginal parameter specifies whether to retain the original columns.

image

PAI command

  1. PAI -name Normalize -project algo_public
  2. -DkeepOriginal="true"
  3. -DoutputTableName="test_4"
  4. -DinputPartitions="pt=20150501" \
  5. -DinputTableName="bank_data_partition"
  6. -DselectedColNames="emp_var_rate,euribor3m";

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
selectedColNames (Optional) Names of the columns selected from the input table NA All columns are selected by default.
inputTablePartitions (Optional) Partitions in the training input table. NA The entire table is selected by default.
outputTableName (Required) Name of the output table NA NA
outputParaTableName (Required) Configuration output table NA NA
inputParaTableName (Optional) Configuration input table NA No input by default
keepOriginal (Optional) Indicates whether to retain original columns. If keepOriginal=true, processed columns are renamed with the normalized_prefix, and the original columns are retained. If keepOriginal=false, all columns are retained but not renamed. NA Default value: false
lifecycle (Optional) Life cycle of the output table NA This parameter is not set by default.
coreNum (Optional). Number of cores NA By default, the value is automatically assigned.
memSizePerCore (Optional) Memory size used by each core NA By default, the value is automatically assigned.

Example

Test data

SQL statement for data generation

  1. drop table if exists normalize_test_input;
  2. create table normalize_test_input(
  3. col_string string,
  4. col_bigint bigint,
  5. col_double double,
  6. col_boolean boolean,
  7. col_datetime datetime);
  8. insert overwrite table normalize_test_input
  9. select
  10. *
  11. from
  12. (
  13. select
  14. '01' as col_string,
  15. 10 as col_bigint,
  16. 10.1 as col_double,
  17. True as col_boolean,
  18. cast('2016-07-01 10:00:00' as datetime) as col_datetime
  19. from dual
  20. union all
  21. select
  22. cast(null as string) as col_string,
  23. 11 as col_bigint,
  24. 10.2 as col_double,
  25. False as col_boolean,
  26. cast('2016-07-02 10:00:00' as datetime) as col_datetime
  27. from dual
  28. union all
  29. select
  30. '02' as col_string,
  31. cast(null as bigint) as col_bigint,
  32. 10.3 as col_double,
  33. True as col_boolean,
  34. cast('2016-07-03 10:00:00' as datetime) as col_datetime
  35. from dual
  36. union all
  37. select
  38. '03' as col_string,
  39. 12 as col_bigint,
  40. cast(null as double) as col_double,
  41. False as col_boolean,
  42. cast('2016-07-04 10:00:00' as datetime) as col_datetime
  43. from dual
  44. union all
  45. select
  46. '04' as col_string,
  47. 13 as col_bigint,
  48. 10.4 as col_double,
  49. cast(null as boolean) as col_boolean,
  50. cast('2016-07-05 10:00:00' as datetime) as col_datetime
  51. from dual
  52. union all
  53. select
  54. '05' as col_string,
  55. 14 as col_bigint,
  56. 10.5 as col_double,
  57. True as col_boolean,
  58. cast(null as datetime) as col_datetime
  59. from dual
  60. ) tmp;

Input data description

col_string col_bigint col_double col_boolean col_datetime
01 10 10.1 true 2016-07-01 10:00:00
NULL 11 10.2 false 2016-07-02 10:00:00
02 NULL 10.3 true 2016-07-03 10:00:00
03 12 NULL false 2016-07-04 10:00:00
04 13 10.4 NULL 2016-07-05 10:00:00
05 14 10.5 true NULL

Running command

  1. drop table if exists normalize_test_input_output;
  2. drop table if exists normalize_test_input_model_output;
  3. PAI -name Normalize
  4. -project algo_public
  5. -DoutputParaTableName="normalize_test_input_model_output"
  6. -Dlifecycle="28"
  7. -DoutputTableName="normalize_test_input_output"
  8. -DinputTableName="normalize_test_input"
  9. -DselectedColNames="col_double,col_bigint"
  10. -DkeepOriginal="true";
  11. drop table if exists normalize_test_input_output_using_model;
  12. drop table if exists normalize_test_input_output_using_model_model_output;
  13. PAI -name Normalize
  14. -project algo_public
  15. -DoutputParaTableName="normalize_test_input_output_using_model_model_output"
  16. -DinputParaTableName="normalize_test_input_model_output"
  17. -Dlifecycle="28"
  18. -DoutputTableName="normalize_test_input_output_using_model"
  19. -DinputTableName="normalize_test_input";

Running result

normalize_test_input_output

  1. +------------+------------+------------+-------------+--------------+-----------------------+-----------------------+
  2. | col_string | col_bigint | col_double | col_boolean | col_datetime | normalized_col_bigint | normalized_col_double |
  3. +------------+------------+------------+-------------+--------------+-----------------------+-----------------------+
  4. | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | 0.0 | 0.0 |
  5. | NULL | 11 | 10.2 | false | 2016-07-02 10:00:00 | 0.25 | 0.2499999999999989 |
  6. | 02 | NULL | 10.3 | true | 2016-07-03 10:00:00 | NULL | 0.5000000000000022 |
  7. | 03 | 12 | NULL | false | 2016-07-04 10:00:00 | 0.5 | NULL |
  8. | 04 | 13 | 10.4 | NULL | 2016-07-05 10:00:00 | 0.75 | 0.7500000000000011 |
  9. | 05 | 14 | 10.5 | true | NULL | 1.0 | 1.0 |
  10. +------------+------------+------------+-------------+--------------+-----------------------+-----------------------+

normalize_test_input_model_output

  1. +------------+------------+
  2. | feature | json |
  3. +------------+------------+
  4. | col_bigint | {"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}} |
  5. | col_double | {"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}} |
  6. +------------+------------+

normalize_test_input_output_using_model

  1. +------------+------------+------------+-------------+--------------+
  2. | col_string | col_bigint | col_double | col_boolean | col_datetime |
  3. +------------+------------+------------+-------------+--------------+
  4. | 01 | 0.0 | 0.0 | true | 2016-07-01 10:00:00 |
  5. | NULL | 0.25 | 0.2499999999999989 | false | 2016-07-02 10:00:00 |
  6. | 02 | NULL | 0.5000000000000022 | true | 2016-07-03 10:00:00 |
  7. | 03 | 0.5 | NULL | false | 2016-07-04 10:00:00 |
  8. | 04 | 0.75 | 0.7500000000000011 | NULL | 2016-07-05 10:00:00 |
  9. | 05 | 1.0 | 1.0 | true | NULL |
  10. +------------+------------+------------+-------------+--------------+

normalize_test_input_output_using_model_model_output

  1. +------------+------------+
  2. | feature | json |
  3. +------------+------------+
  4. | col_bigint | {"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}} |
  5. | col_double | {"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}} |
  6. +------------+------------+

Standardization

  • Standardize one or multiple columns in a table and save the generated data in a new table.

  • The formula used for standardization is (X - Mean)/(Standard deviation).

  • Mean: Average value of samples.

  • Standard deviation: Standard sample deviation. This field is used when samples are selected for a whole, and the samples are used to calculate the total deviation. To make the calculated value closer to the overall level, you must moderately increase the calculated value of standard deviation, that is, img.

  • Formula of standard sample deviation: img img indicates the average value of the used samples X1, X2,…, Xn.

  • You can choose whether to retain the original columns. If you select the corresponding check box, the original columns are retained, and processed columns are renamed.

  • Click the Select Field and select the columns to be standardized. The double and bigint types are supported.

Standardization UI

image

The two input columns correspond to the following parameters respectively:

  • inputTableName: Name of the input table to be standardized.

  • inputParaTableName: Name of the input parameter table, that is, the parameter list generated by the standardization node. Configuration parameters in the table can be applied to a new table based on this parameter.

The two output columns correspond to the following parameters respectively:

  • outputTableName: Name of the output table after standardization.

  • outputParaTableName: Name of the output parameter table, which is applied to other datasets.

UI of standardization parameters

The keepOriginal parameter specifies whether to retain the original columns.

image

PAI command

  1. PAI -name Standardize -project algo_public
  2. -DkeepOriginal="false"
  3. -DoutputTableName="test_5" \
  4. -DinputPartitions="pt=20150501"
  5. -DinputTableName="bank_data_partition"
  6. -DselectedColNames="euribor3m,pdays";

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table NA NA
selectedColNames (Optional) Names of the columns selected from the input table NA All columns are selected by default.
inputTablePartitions (Optional) Partitions in the training input table. NA The entire table is selected by default.
outputTableName (Required) Name of the output table NA NA
outputParaTableName (Required) Configuration output table NA NA
inputParaTableName (Optional) Configuration input table NA No input by default
keepOriginal (Optional) Indicates whether to retain original columns. If keepOriginal=true, processed columns are renamed with the stdized_prefix, and the original columns are retained. If keepOriginal=false, all columns are retained but not renamed. NA Default value: false
lifecycle (Optional) Life cycle of the output table NA This parameter is not set by default.
coreNum (Optional). Number of cores NA By default, the value is automatically assigned.
memSizePerCore (Optional) Memory size used by each core NA By default, the value is automatically assigned.

Example

Test data

SQL statement for data generation

  1. drop table if exists standardize_test_input;
  2. create table standardize_test_input(
  3. col_string string,
  4. col_bigint bigint,
  5. col_double double,
  6. col_boolean boolean,
  7. col_datetime datetime);
  8. insert overwrite table standardize_test_input
  9. select
  10. *
  11. from
  12. (
  13. select
  14. '01' as col_string,
  15. 10 as col_bigint,
  16. 10.1 as col_double,
  17. True as col_boolean,
  18. cast('2016-07-01 10:00:00' as datetime) as col_datetime
  19. from dual
  20. union all
  21. select
  22. cast(null as string) as col_string,
  23. 11 as col_bigint,
  24. 10.2 as col_double,
  25. False as col_boolean,
  26. cast('2016-07-02 10:00:00' as datetime) as col_datetime
  27. from dual
  28. union all
  29. select
  30. '02' as col_string,
  31. cast(null as bigint) as col_bigint,
  32. 10.3 as col_double,
  33. True as col_boolean,
  34. cast('2016-07-03 10:00:00' as datetime) as col_datetime
  35. from dual
  36. union all
  37. select
  38. '03' as col_string,
  39. 12 as col_bigint,
  40. cast(null as double) as col_double,
  41. False as col_boolean,
  42. cast('2016-07-04 10:00:00' as datetime) as col_datetime
  43. from dual
  44. union all
  45. select
  46. '04' as col_string,
  47. 13 as col_bigint,
  48. 10.4 as col_double,
  49. cast(null as boolean) as col_boolean,
  50. cast('2016-07-05 10:00:00' as datetime) as col_datetime
  51. from dual
  52. union all
  53. select
  54. '05' as col_string,
  55. 14 as col_bigint,
  56. 10.5 as col_double,
  57. True as col_boolean,
  58. cast(null as datetime) as col_datetime
  59. from dual
  60. ) tmp;

Input data description

col_string col_bigint col_double col_boolean col_datetime
01 10 10.1 true 2016-07-01 10:00:00
NULL 11 10.2 false 2016-07-02 10:00:00
02 NULL 10.3 true 2016-07-03 10:00:00
03 12 NULL false 2016-07-04 10:00:00
04 13 10.4 NULL 2016-07-05 10:00:00
05 14 10.5 true NULL

Running command

  1. drop table if exists standardize_test_input_output;
  2. drop table if exists standardize_test_input_model_output;
  3. PAI -name Standardize
  4. -project algo_public
  5. -DoutputParaTableName="standardize_test_input_model_output"
  6. -Dlifecycle="28"
  7. -DoutputTableName="standardize_test_input_output"
  8. -DinputTableName="standardize_test_input"
  9. -DselectedColNames="col_double,col_bigint"
  10. -DkeepOriginal="true";
  11. drop table if exists standardize_test_input_output_using_model;
  12. drop table if exists standardize_test_input_output_using_model_model_output;
  13. PAI -name Standardize
  14. -project algo_public
  15. -DoutputParaTableName="standardize_test_input_output_using_model_model_output"
  16. -DinputParaTableName="standardize_test_input_model_output"
  17. -Dlifecycle="28"
  18. -DoutputTableName="standardize_test_input_output_using_model"
  19. -DinputTableName="standardize_test_input";

Running result

standardize_test_input_output

  1. +------------+------------+------------+-------------+--------------+--------------------+--------------------+
  2. | col_string | col_bigint | col_double | col_boolean | col_datetime | stdized_col_bigint | stdized_col_double |
  3. +------------+------------+------------+-------------+--------------+--------------------+--------------------+
  4. | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | -1.2649110640673518 | -1.2649110640683832 |
  5. | NULL | 11 | 10.2 | false | 2016-07-02 10:00:00 | -0.6324555320336759 | -0.6324555320341972 |
  6. | 02 | NULL | 10.3 | true | 2016-07-03 10:00:00 | NULL | 0.0 |
  7. | 03 | 12 | NULL | false | 2016-07-04 10:00:00 | 0.0 | NULL |
  8. | 04 | 13 | 10.4 | NULL | 2016-07-05 10:00:00 | 0.6324555320336759 | 0.6324555320341859 |
  9. | 05 | 14 | 10.5 | true | NULL | 1.2649110640673518 | 1.2649110640683718 |
  10. +------------+------------+------------+-------------+--------------+--------------------+--------------------+

standardize_test_input_model_output

  1. +------------+------------+
  2. | feature | json |
  3. +------------+------------+
  4. | col_bigint | {"name": "standardize", "type":"bigint", "paras":{"mean":12, "std": 1.58113883008419}} |
  5. | col_double | {"name": "standardize", "type":"double", "paras":{"mean":10.3, "std": 0.1581138830082909}} |
  6. +------------+------------+

standardize_test_input_output_using_model

  1. +------------+------------+------------+-------------+--------------+
  2. | col_string | col_bigint | col_double | col_boolean | col_datetime |
  3. +------------+------------+------------+-------------+--------------+
  4. | 01 | -1.2649110640673515 | -1.264911064068383 | true | 2016-07-01 10:00:00 |
  5. | NULL | -0.6324555320336758 | -0.6324555320341971 | false | 2016-07-02 10:00:00 |
  6. | 02 | NULL | 0.0 | true | 2016-07-03 10:00:00 |
  7. | 03 | 0.0 | NULL | false | 2016-07-04 10:00:00 |
  8. | 04 | 0.6324555320336758 | 0.6324555320341858 | NULL | 2016-07-05 10:00:00 |
  9. | 05 | 1.2649110640673515 | 1.2649110640683716 | true | NULL |
  10. +------------+------------+------------+-------------+--------------+

standardize_test_input_output_using_model_model_output

  1. +------------+------------+
  2. | feature | json |
  3. +------------+------------+
  4. | col_bigint | {"name": "standardize", "type":"bigint", "paras":{"mean":12, "std": 1.58113883008419}} |
  5. | col_double | {"name": "standardize", "type":"double", "paras":{"mean":10.3, "std": 0.1581138830082909}} |
  6. +------------+------------+

Type conversion

Convert the types of table fields.

PAI command

  1. PAI -name type_transform
  2. -project algo_public
  3. -DinputTable="pai_dense"
  4. -DselectedCols="gail,loss,work_year"
  5. -Dpre_type="double"
  6. -Dnew_type="bigint"
  7. -DoutputTable="pai_temp_2250_20272_1"
  8. -Dlifecycle="28"

Algorithm parameters

Parameter Description Option Default value
inputTable (Required) Name of the input table NA NA
inputTablePartitions (Optional) Partitions used for training in the input table, in the format of Partition_name=value.The multilevel format is name1=value1/name2=value2. If multiple partitions are specified, use commas (,) to separate them. NA All partitions in the input table
outputTable (Required) Name of the type conversion result table NA NA
selectedCols (Required) Feature columns that require type conversion. The type must be the same. NA NA
pre_type Original field type. The type must be the same as the field type selected in selectedCols; otherwise, an error of field type inconsistency is reported. NA NA
new_type New field type set by the user NA 100
lifecycle (Optional) Life cycle of the outputTable. Default value: 7 NA 7

Use demo

  1. Drag a data table reading component and configure the data table pai_dense as follows:

    image

    image

  2. Drag a type conversion component and select the feature to be converted in the parameter configuration column on the right. For example, in the following figure, select three columns whose original data type is double (the double type must be the same as the selected field type) and convert them into the bigint type.

    image

    image

  3. Right-click the Run. The field type is changed in the output table.

    image

KV2Table

Convert the specified kv (key:value) format into the common table format. The key is converted into a column name of the table, and the value is converted into the value of the column in the corresponding row.

KV table format definition: Key is the column name index, and the value type can be bigint or double. This sparse format allows input of algorithm components such as logical regression and linear regression. The key type can also be string. The custom key_map table (mapping of column names and keys) can be imported into this component. Regardless of whether this table is imported, this component outputs the mapping of column names and keys after the key_map table is converted.

kv
1:10;2:20;3:30

KeyMap table format definition: A triple table that includes mapping of column names and indexes and type information. The type of col_name, col_index, and col_datatype must be string, and the default value of col_datatype is double if it is not provided.

col_name col_index col_datatype
col1 1 bigint
col2 2 double

PAI command

  1. PAI -name KVToTable
  2. -project algo_public
  3. -DinputTableName=test
  4. -DoutputTableName=test_out
  5. -DoutputKeyMapTableName=test_keymap_out
  6. -DkvColName=kv;

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Name of the input table NA (Required) The table cannot be empty.
kvColName kv column name Select only one column Required
outputTableName Name of the output table NA Required
outputKeyMapTableName Name of the output index table NA Required
inputKeyMapTableName Name of the input index table NA Optional, “”
appendColName Name of the appended column Multiple columns can be selected Optional, “”
inputTablePartitions Input table partitions NA Optional, “”
kvDelimiter Delimiter between the key and the value NA (Optional) Default value: “:”
itemDelimiter Delimiter between key-value pairs NA (Optional) Default value: “,”
top1200 Indicates whether to truncate the first 1200 columns NA (Optional) The default value is true. If the value is false, an error is reported when the number of columns exceeds the maximum value.
lifecycle Life cycle An integer equal to or more than -1 (Optional) Default value: -1, which indicates that life cycle is not set.
coreNum Number of cores An integer more than 0 (Optional) Default value: -1, which indicates that the number of started instances is determined based on the input data volume.
memSizePerCore Memory size (100,64*1024) (Optional) Default value: -1, which indicates that the memory size is determined based on the input data volume.

Example

Data generation

  1. drop table if exists test;
  2. create table test as
  3. select
  4. *
  5. from
  6. (
  7. select '1:1,2:2,3:-3.3' as kv from dual
  8. union all
  9. select '1:10,2:20,3:-33.3' as kv from dual
  10. ) tmp;

Running command

  1. PAI -name KVToTable
  2. -project algo_public
  3. -DinputTableName=test
  4. -DoutputTableName=test_out
  5. -DoutputKeyMapTableName=test_keymap_out
  6. -DkvColName=kv;

Output description

Output table

kv_1 kv_2 kv_3
1.0 2.0 -3.3
10.0 20.0 -33.3

Output mapping table

col_name col_index col_type
kv_1 1 double
kv_2 2 double
kv_3 3 double

Algorithm scale

The converted columns include those converted from the append and kv columns. The kv column is output before the append column. When the number of columns exceeds the maximum ODPS column number, the maximum number of columns are output if top1200 is set to True; otherwise, an error is reported. The maximum ODPS column number is 1200.

The number of data records cannot exceed 100 million.

Important notes

  • If the key_map table is input, the converted columns are the keys that exist in both the key_map table and the key-value table.
  • The converted column type must be numeric.
  • If the key_map table is input, the type of the converted key column must be the same as that in the key_map table. If the key_map table is not input, the type of the converted key column is double.
  • If the key_map table is not input, the name format of the converted key column is kv column name+” “+key. If the key contains any of “%,&()*+-./;<>=?”, an error is reported.
  • Column name conflict: an error is reported if the append column is specified, and the append column name is the same as the converted key column name.
  • If a row contains duplicate keys, the values are added.
  • If the column name contains more than 128 characters, only the first 128 characters are kept.

Table2KV

Convert common tables (table) into tables of the kv format. Null values in the original table are not reflected in the new table. You can specify columns that need to be retained in the new table. The specified columns are retained in their original formats.

PAI command

  1. PAI -name TableToKV
  2. -project algo_public
  3. -DinputTableName=maple_tabletokv_basic_input
  4. -DoutputTableName=maple_tabletokv_basic_output
  5. -DselectedColNames=col0,col1,col2
  6. -DappendColNames=rowid;

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table Table name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel partition name format is name1=value1/name2=value2. If you specify multiple partitions, separate them with a comma (,). NA (Optional) Default: all partitions
selectedColNames Selected column names. The type must be bigint or double. Column name (Optional) Select the entire table.
appendColNames Column to be retained. It is written in the output table in its original format. Column name (Optional) Default value: null
outputTableName Name of the output KV table Table name Required
kvDelimiter Delimiter between the key and the value Character (Optional) Default value: “:”
itemDelimiter Delimiter between key-value pairs Character (Optional) Default value: “,”
convertColToIndexId Indicates whether to convert columns into numbers. 1: yes; 0: no 0 or 1 (Optional) Default value: 0
inputKeyMapTableName Input index table name. This parameter is valid only when convertColToIndexId=1. If this parameter is not set, a set of IDs are automatically generated. Table name (Optional) Default value: “”
outputKeyMapTableName Output index table name. This parameter is required only when convertColToIndexId=1. Table name Specified by convertColToIndexId
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum Number of nodes Used together with the memSizePerCore parameter, positive integer in the range of [1, 9999] (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] (Optional) Automatically calculated by default

Example

Data generation

rowid kv
0 col0:1,col1:1.1,col2:2
1 col0:0,col1:1.2,col2:3
2 col0:1,col1:2.3
3 col0:1,col1:0.0,col2:4

Running command

  1. PAI -name TableToKV
  2. -project algo_public
  3. -DinputTableName=maple_tabletokv_basic_input
  4. -DoutputTableName=maple_tabletokv_basic_output
  5. -DselectedColNames=col0,col1,col2
  6. -DappendColNames=rowid;

Output description

Output table:
maple_tabletokv_basic_output

rowid:bigint kv:string
0 1:1.1,2:2
1 1:1.2,2:3
2 1:2.3
3 1:0.0,2:4

Example 2

Running command

  1. PAI -name TableToKV
  2. -project projectxlib4 -DinputTableName=maple_tabletokv_basic_input
  3. -DoutputTableName=maple_tabletokv_basic_output
  4. -DselectedColNames=col0,col1,col2 -DappendColNames=rowid
  5. -DconvertColToIndexId=1
  6. -DinputKeyMapTableName=maple_test_tabletokv_basic_map_input
  7. -DoutputKeyMapTableName=maple_test_tabletokv_basic_map_output;

Output description

Output table:
maple_test_tabletokv_basic_map_output

col_name:string col_index:string col_datatype:string
col1 1 bigint
col2 2 double

Important notes

  • If the key_map table is input, the converted columns are the keys that exist in both the key_map table and the key-value table.

  • If the key_map table is input, and the type is different from the input table, the output key_map table uses the type specified by the user.

  • The type of the columns that need to be converted into kv in the input table must be bigint or double.

Thank you! We've received your feedback.