Feature engineering is essential to model training in machine learning. Feature engineering helps find feature crosses that are beneficial to models. Generally, algorithm engineers must spend a lot of efforts in feature engineering. Machine Learning Platform for AI (PAI) provides the Auto Feature Cross component to help you find effective feature crosses. This topic describes how to use the Auto Feature Cross component.

Flowchart

The Auto Feature Cross component is developed based on the deep learning framework TensorFlow. This component involves intensive parallel computing at the underlying layer and requires GPU resources. The Auto Feature Cross component is supported only in the China (Beijing) and China (Shanghai) regions.

The following flowchart shows the process of automatic feature engineering.Process of automatic feature engineering
Note You can create an experiment that contains the Auto Feature Cross component by using a template in the Templates section on the Home page. In this case, you must set the Output path parameter of the Auto Feature Cross component to the URL of the Object Storage Service (OSS) bucket under your account.

1. Authorize PAI to access your GPU resources and OSS bucket

  1. Log on to the PAI console. In the left-side navigation pane, choose Model Training > Studio-Modeling Visualization. On the page that appears, find the project in which you want to perform operations and click Machine Learning in the Operation column.
  2. On the page that appears, click Settings in the left-side navigation pane. On the Settings page, select the Authorize Machine Learning Platform for AI to access my OSS resources and Pay by used check boxes in the General pane.

2. Bin data

The Auto Feature Cross component supports only the BIGINT data type. However, raw data in most business scenarios are of the DOUBLE data type, as shown in the following figure.

DOUBLE data type

In this case, you must use the SQL Script or One Hot Encoding component to convert the raw data from the DOUBLE type to the BIGINT type. In addition, you must use the Feature Discretization component to decompose feature data in different intervals into different bins. The following figure shows the data after binning.

Data after binning

3. Determine the range of feature values

The basic process of feature crossing includes the following steps: generate feature vectors for features, create feature crosses, verify the feature crosses, and then select effective feature crosses. Before you generate feature vectors, you must know the maximum feature value in each feature space. Example:
  • The maximum value of the thalach feature is 4.
  • The maximum value of the oldpeak feature is 3.
  • The maximum value of the ca feature is 4.
Determine the range of feature values

You can execute the following SQL statement to obtain the maximum value of each feature:

select max(feature) from table;

In the sample data of this topic, the maximum value of all features after binning is 4.

Maximum value of all features after binning

You must specify the Feature length parameter of the Auto Feature Cross component in the format shown in the following figure. In the format, 5 indicates a left-closed right-open interval [0,5) that includes 4.

Feature length

4. Prepare training data and test data

In this topic, the training data is the same as the test data. In actual use, the test data can differ from the training data, provided that the fields in the test data are the same as the fields in the training data.

5. Configure the Auto Feature Cross component

  • Set the parameters on the Fields Setting tab
    In the Auto Feature Cross component, the input port on the left is used to import training data and the input port on the right is used to import test data.You must set the following parameters on the Fields Setting tab based on your needs:
    • Feature selection: the feature fields that will be involved in feature crossing.
    • if sparse data: specifies whether the data is sparse data. This check box is cleared by default, which means that the data is dense data.
    • Label: the label column that is used to determine whether a feature cross is effective.
    • Output path: the URL of the OSS bucket that stores the generated model.
  • Set the parameters on the Parameters Setting tabSet the parameters on the Parameters Setting tab
    • Ergodic number: the number of iterations.
    • Feature order: the maximum number of features in each feature cross. For example, 3 indicates that each feature cross involves a maximum of three features.
In this example, the Auto Feature Cross component runs the following PAI command to create feature crosses:
PAI -name fives_ext -project algo_public     
    -DlabelColName="ifhealth"   // The label column that is used to determine whether a feature cross is effective.    
    -Dmetric_file="metric_log.log" // The name of the system log file.    
    -Dfeature_meta="[5,5,5,5,5,5,5,5,5,5,5,5,5]"     
    -DtrainTable="odps://Project name/tables/Table name"      
    -Dbuckets="oss://{oss_bucket}/"     
    -Dthreshold="0.5"     
    -Dk="3"     
    -DossHost="oss-cn-beijing-internal.aliyuncs.com" // The region in which OSS is activated.    
    -Demb_dims="16"     
    -DenableSparse="0"     
    -Dtemp_anneal_steps="30000"     
    -DfeatureColName="sex,cp,fbs,restecg,exang,slop,thal,age,trestbps,chol,thalach,oldpeak,ca"    // The feature fields that will be involved in feature crossing.    
    -DtestTable="odps:// Project name/tables/Table name"     
    -Darn="acs:ram::********:role/aliyunodpspaidefaultrole"  //rolearn    
    -Depochs="1500"     
    -DcheckpointDir="oss://{oss_bucket}/{path}/";

View the feature crosses

In the root directory of your OSS bucket, find the interactions.json file. The root directory of your OSS bucket is specified by buckets.

The file shows the effective feature crosses that are created by the Auto Feature Cross component.Feature crosses in the interactions.json file
You can create other feature crosses based on the feature crosses in the file. Example:
  • [0,1] indicates that the cross of the first and second features is effective. The feature order in each feature cross is the same as the feature order in the input table.
  • [8, 6, 5] indicates that the cross of the ninth, seventh, and fifth features is effective.