Feature engineering is essential to model training in machine learning. Feature engineering helps find feature crosses that are beneficial to models. Generally, algorithm engineers must spend a lot of efforts in feature engineering. Machine Learning Platform for AI (PAI) provides the Auto Feature Cross component to help you find effective feature crosses. You can combine the features that form the feature crosses to optimize your model. This topic describes how to use the Auto Feature Cross component.

Flowchart

The Auto Feature Cross component is developed based on the deep learning framework TensorFlow. This component involves intensive parallel computing at the underlying layer and requires GPU resources. Only the China (Beijing) and China (Shanghai) regions support Auto Feature Cross component.

The following figure shows the process of automatic feature engineering. Process of automatic feature engineering
Note You can create an experiment that contains the Auto Feature Cross component by using a template in the Templates section on the Home page. In this case, you must set the Output path parameter of the Auto Feature Cross component to the URL of the Object Storage Service (OSS) bucket within your account.

1.Authorize PAI to access your GPU resources and OSS bucket

  1. Log on to the PAI console. In the left-side navigation pane, choose Model Training > Visualized Modeling (Machine Learning Studio). On the page that appears, find the project in which you want to perform operations and click Machine Learning in the Actions column.
  2. On the page that appears, click Settings in the left-side navigation pane. On the Settings page, select Authorize Machine Learning Platform for AI to access my OSS resources and Pay by used on the General tab.

2.Bin data

The Auto Feature Cross component supports only the BIGINT data type. However, raw data in most business scenarios are of the DOUBLE data type, as shown in the following figure.

DOUBLE data type

In this case, you must use the SQL Script or One Hot Encoding component to convert the raw data from the DOUBLE type to the BIGINT type. In addition, you must use the Feature Discretization component to decompose feature data in different intervals into different bins. The following figure shows the data after binning.

Data after binning

3.Determine the range of feature values

The basic process of feature crossing include the following steps: generate feature vectors for features, create feature crosses, verify the feature crosses, and then select effective feature crosses. Before you generate feature vectors, you must know the maximum feature value in each feature space. Examples:
  • The maximum value of the thalach feature is 4.
  • The maximum value of the oldpeak feature is 3.
  • The maximum value of the ca feature is 4.
Determine the range of feature values

You can execute the following SQL statement to obtain the maximum value of each feature:

select max(feature) from table;

In the sample data of this topic, the maximum value of all features after binning is 4.

Maximum value of all features after binning

You must set the Feature length parameter of the Auto Feature Cross component in the format shown in the following figure. In the format, 5 indicates a left-closed, right-open interval [0,5) that includes 4.

Feature length

4.Prepare training data and test data

In this topic, the training data is the same as the test data. In actual use, the test data can differ from the training data, provided that the fields in the test data are the same as the fields in the training data.

5.Fields Setting tab

  • Set the parameters on the Fields Setting tab
    In the Auto Feature Cross component, the input port on the left is used to import training data and the input port on the right is used to import test data.
    • Feature selection: the feature columns that are selected for feature crossing.
    • if sparse data: specifies whether the input data is in the sparse format. This check box is cleared by default, which means that the data is in the dense format.
    • Label: the label column that is used to determine whether a feature cross is effective.
    • Output path: the URL of the OSS bucket that stores the generated model.
  • Parameters Setting tabSet the parameters on the Parameters Setting tab
    • Ergodic number: the number of iterations.
    • Feature order: the maximum number of features in each feature cross. For example, 3 indicates that each feature cross involves a maximum of three features.
You can also run the following PAI command:
PAI -name fives_ext -project algo_public     
    -DlabelColName="ifhealth"   // The label column that is used to determine whether a feature cross is effective.    
    -Dmetric_file="metric_log.log" // The name of the system log file.    
    -Dfeature_meta="[5,5,5,5,5,5,5,5,5,5,5,5,5]"     
    -DtrainTable="odps://Project name/tables/Table name"      
    -Dbuckets="oss://{oss_bucket}/"     
    -Dthreshold="0.5"     
    -Dk="3"     
    -DossHost="oss-cn-beijing-internal.aliyuncs.com" // The region in which OSS is activated.    
    -Demb_dims="16"     
    -DenableSparse="0"     
    -Dtemp_anneal_steps="30000"     
    -DfeatureColName="sex,cp,fbs,restecg,exang,slop,thal,age,trestbps,chol,thalach,oldpeak,ca"    // The feature columns that are selected for feature crossing.    
    -DtestTable="odps:// Project name/tables/Table name"     
    -Darn="acs:ram::********:role/aliyunodpspaidefaultrole"  //rolearn    
    -Depochs="1500"     
    -DcheckpointDir="oss://{oss_bucket}/{path}/";

View the feature crosses

In the root directory of your OSS bucket, find the interactions.json file. The root directory of your OSS bucket is specified by Dbuckets.

The file shows the effective feature crosses that are created by the Auto Feature Cross component. Feature crosses in the interactions.json file
You can create other feature crosses based on the feature crosses in the file. Examples:
  • [0,1] indicates that the cross of the first and second features is effective. The feature order in each feature cross is the same as the feature order in the input table.
  • [8, 6, 5] indicates that the cross of the ninth, seventh, and fifth features is effective.