Use AutoML for automatic feature engineering - - Alibaba Cloud Documentation Center

Feature engineering is essential to model training in machine learning. Feature engineering helps find feature crosses for model optimization. Generally, algorithm engineers need to spend a lot of efforts in feature engineering. Machine Learning Platform for AI (PAI) provides the Auto Feature Cross component to help you find effective feature crosses. You can combine the features that form the feature crosses to optimize your model. This topic describes how to use the Auto Feature Cross component.

Flowchart

The Auto Feature Cross component is developed based on the deep learning framework TensorFlow. This component involves intensive parallel computing under the hood and requires GPU resources. Only the China (Beijing) and China (Shanghai) regions support the Auto Feature Cross component.

The following figure shows the process of automatic feature engineering.

Note You can create an experiment that contains the Auto Feature Cross component by using a template in the Templates section of the Home page. In this case, you must set the Output path parameter of the Auto Feature Cross component to the endpoint of an Object Storage Service (OSS) bucket within your account.

1. Authorize PAI to access your GPU resources and OSS bucket

Log on to the PAI console and go to the Visualized Modeling (Machine Learning Studio) page. For more information, see Use DataWorks tasks to schedule experiments in Machine Learning Studio.
On the page that appears, click Settings in the left-side navigation pane. On the Settings page, select Authorize Machine Learning Platform for AI to access my OSS resources and enable GPU computing on the General tab.

2. Bin data

The Auto Feature Cross component supports only the BIGINT data type. However, raw data in most business scenarios is of the DOUBLE data type, as shown in the following figure.

In this case, you must use the SQL Script or One Hot Encoding component to convert the raw data from the DOUBLE type to the BIGINT type. In addition, you must use the Feature Discretization component to decompose feature data in different intervals into different bins. The following figure shows the data after binning.

3. Determine the range of feature values

The basic process of feature crossing includes the following steps: generate feature vectors, create feature crosses, verify the feature crosses, and then select effective feature crosses. Before you generate feature vectors, you must know the maximum feature value in each feature space. Examples:

The maximum value of the thalach feature is 4.
The maximum value of the oldpeak feature is 3.
The maximum value of the ca feature is 4.

You can execute the following SQL statement to query the maximum value of each feature:

select max(feature) from table;

In the sample data of this topic, the maximum value after binning is 4 for all features.

Maximum values of all features after binning

You must set the Feature length parameter of the Auto Feature Cross component in the [5,5,5,5,5,5,5,5,5,5,5,5,5] format, as shown in the following figure. In the format, 5 indicates a left-closed, right-open interval [0,5) that includes 4.

4. Prepare training data and test data

In this topic, the training data is the same as the test data. In actual use, the test data can differ from the training data, provided that the fields in the test data are the same as the fields in the training data.

5. Configure the Auto Feature Cross component

Set the parameters on the Fields Setting tab
In the Auto Feature Cross component, the input port on the left is used to import training data and the input port on the right is used to import test data.
- Feature selection: the feature columns that are selected for feature crossing.
- if sparse data: specifies whether the input data is sparse. By default, this check box is not selected, which means that the data is dense.
- Label: the label column that is used to determine whether a feature cross is effective.
- Output path: the endpoint of the OSS bucket that stores the generated model.
Set the parameters on the Parameters Setting tab
- Ergodic number: the number of iterations.
- Feature order: the maximum number of features in each feature cross. For example, a value of 3 indicates that each feature cross involves a maximum of three features.

You can also run the following PAI command:

PAI -name fives_ext -project algo_public     
    -DlabelColName="ifhealth"   // The label column that is used to determine whether a feature cross is effective.    
    -Dmetric_file="metric_log.log" // The name of the system log file.    
    -Dfeature_meta="[5,5,5,5,5,5,5,5,5,5,5,5,5]"     
    -DtrainTable="odps://Project name/tables/Table name"      
    -Dbuckets="oss://{oss_bucket}/"     
    -Dthreshold="0.5"     
    -Dk="3"     
    -DossHost="oss-cn-beijing-internal.aliyuncs.com" // The region in which OSS is activated.    
    -Demb_dims="16"     
    -DenableSparse="0"     
    -Dtemp_anneal_steps="30000"     
    -DfeatureColName="sex,cp,fbs,restecg,exang,slop,thal,age,trestbps,chol,thalach,oldpeak,ca"    // The feature columns that are selected for feature crossing.    
    -DtestTable="odps://Project name/tables/Table name"     
    -Darn="acs:ram::********:role/aliyunodpspaidefaultrole"  //rolearn    
    -Depochs="1500"     
    -DcheckpointDir="oss://{oss_bucket}/{path}/";

View the feature crosses

In the root directory of your OSS bucket, find the interactions.json file. The root directory of your OSS bucket is specified by the Dbuckets parameter.

The file shows the effective feature crosses that are created by the Auto Feature Cross component. Feature crosses in the interactions.json file

Feature crosses in the interactions.json file

You can create other feature crosses based on the feature crosses in the file. Examples:

[0,1] indicates that the cross of the first and second features is effective. The feature order in each feature cross is the same as the feature order in the input table.
[8, 6, 5] indicates that the cross of the seventh, fifth, and fourth features is effective.