All Products
Search
Document Center

Artificial Intelligence Recommendation:Data fields, feature fields, and FG features in EasyRec

Last Updated:Oct 29, 2024

The EasyRec algorithm framework involves three key concepts: data field, feature field, and Feature Generator (FG) feature. This topic describes these concepts and their differences.

image

Role of FG in offline training and online inference of EasyRec

Terms

FeatureStore: FeatureStore is a feature management tool provided by Platform for AI (PAI). You can use this tool to store and manage features in offline and online systems. For more information, see Overview of FeatureStore.

FG: FG is designed to ensure consistency between offline and online feature processing. FG can generate the following features: ID features, raw features, combo features, lookup features, match features, sequence features, and overlap features. Lookup features and sequence features are commonly used. For more information, see RTP FG.

User features: The user features shown in the preceding figure include those obtained from both offline and online systems. In the lower-left corner of the preceding figure, the PAI-Rec engine reads user features by using a FeatureStore SDK.

Item features: The scoring service EasyRec processor reads item features by using a FeatureStore SDK.

Assemble inputs: The user features in requests and the item features in the cache are assembled for feature generation. After feature generation, the generated features are imported to a TensorFlow model for scoring.

EasyRec processor: It is deployed on Elastic Algorithm Service (EAS) of PAI and used to score recommendation, advertising, and search models. The EasyRec processor can load EasyRec deep learning models and conduct performance optimization.

easyrec.conf: This file describes the data fields, feature types, and network structures used by EasyRec deep learning models.

Pipelines

Offline pipeline: FeatureStore uses UserViews (multiple user-side feature views), ItemViews (multiple item-side feature views), and Label Table (MaxCompute tables with training labels) to construct a model feature (training sample table). FG (the MapReduce JAR package fg_on_odps-1.3.59-jar-with-dependencies.jar) works with the fg.json file to transform the model feature into the result table rank_sample_fg_encoded. The model is trained on PAI based on the easyrec.conf file, exported to and stored in Object Storage Service (OSS), and then imported to a TensorFlow model for scoring.

Online pipeline: The PAI-Rec engine obtains user features and item IDs to be scored (this part of logic is not shown in the preceding figure), and requests the EasyRec processor to assemble features. Then, the EasyRec processor uses FG to perform online feature transformation, scores the item IDs, and then returns scores.

1. Data fields and feature fields in the EasyRec configuration file

1.1. Data field configuration: data_config

data_config specifies the names and types of the original data fields in the easyrec.conf file, and the methods of imputing missing values for these data fields. For more information, see Data fields in EasyRec. The value types of data fields include int, double, and string. Data can be stored in various formats such as CSV files, MaxCompute tables, and Kafka data streams.

  • Example: data of the key-value type

Missing value imputation: The missing value is imputed with a key-value pair. In this example, the key-value pair is -1024:0.

Logic for offline training and online prediction: The missing value is imputed in a TensorFlow model.

input_fields: {
 input_name: "prop_kv"
 input_type: STRING
 default_val:"-1024:0"
}

1.2. Feature field configuration: feature_config

data_config mainly defines data fields and labels. feature_config specifies how these data fields are parsed and used by models. For example, Field x is used as a feature of the STRING type. In feature_config, this field can be interpreted as an ID feature, tag feature, or sequence feature.

  • Example: ID feature

 features {
   input_names: "user_id"
   feature_type: IdFeature
   embedding_dim: 32
   hash_bucket_size: 100000
 }

Description: The features indicated by user_id are hash mapped to 100,000 buckets. Each bucket ID is mapped to a 32-dimension vector through model training.

Logic for offline training and online prediction: The features indicated by user_id are transformed into 32-dimension vectors in the TensorFlow model.

  • Example: raw feature

features {
 input_names: "ctr"
 feature_type: RawFeature
 boundaries: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
 embedding_dim: 8
}

Preprocessing: The binning algorithm of PAI is used to obtain a set of boundaries. These boundaries are then configured for the features in feature_config.

Logic for offline training and online prediction: For the features indicated by ctr, the boundaries are bucketed to generate bucket IDs, and the bucket IDs are then mapped to vectors.

  • Example: tag feature

features : {
 input_names: 'tags'
 feature_type: TagFeature
 separator: '|'
 hash_bucket_size: 100000
 embedding_dim: 24
}

For example, the tag features of an article are Entertainment|Funny|Popular, where the vertical bar (|) is used as a separator.

Logic for offline training and online prediction: Hash embedding is performed on tag features in the TensorFlow model to generate embedding vectors. Then, average pooling is performed on the vectors to generate an average embedding vector.

2. Lookup feature transformation in FG

FG supports a variety of feature combinations and transformation methods. The following is an example of lookup feature transformation.

{
 "map": "user:map_brand_click_kv", 
 "key":"item:brand", 
 "feature_name": "map_brand_click_count", 
 "feature_type":"lookup_feature", 
 "needDiscrete":false, 
}

  • Offline data processing: The fg_on_odps-1.3.59-jar-with-dependencies.jar package is required in offline systems. The item feature brand is used as a key to query the value of the user feature map_brand_click_kv, and then the obtained value is used for the new feature map_brand_click_count. In this example, map_brand_click_kv indicates the number of user clicks under different brands, and map_brand_click_count indicates the number of user clicks under the current item brand.

  • Online prediction: The EasyRec processor (scoring service) uses FG to calculate the value of map_brand_click_count and then sends the value to the TensorFlow model for prediction.

3. FAQ

3.1. Where do boundaries come from and what do they mean?

In the following example of the EasyRec configuration file, the configurations for the features indicated by CTR contain the boundaries parameter.

feature_configs: {
 input_names: "CTR"
 feature_type: RawFeature
 boundaries: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
 embedding_dim: 8
}

The value of the boundaries parameter is obtained by using the Binning component of PAI.

In the TensorFlow model, when the Binning component performs discretization on the features indicated by CTR, the boundaries are regarded as a series of discrete intervals: (-inf, 0.1), [0.1, 0.2), [0.2, 0.3), [0.3, 0.4), [0.4, 0.5), [0.5, 0.6), [0.6, 0.7), [0.7, 0.8), [0.8, 0.9), and [0.9, 1.0). After you configure the boundaries parameter, you need to configure the embedding_dim parameter to convert the interval numbers into vectors.

3.2. How is FG used to perform feature transformation in an online system? How can consistency be ensured between offline and online systems?

The fg.json file is used to describe a feature transformation process. Both online and offline systems use the same code to ensure the consistency of feature transformation logic. Online systems use FG to perform feature transformation.

3.3. How is missing value imputation implemented for features? Where do I configure missing value imputation?

When you configure the easyrec.config file, you can configure the default_value parameter for each feature field in data_config to impute missing values. During model training, missing values are imputed with the values of the default_value parameter when data is read. During online model inferences, missing values must be imputed first so that inferences can be performed.

If FG is used, missing value imputation is configured in the fg.json file for both online and offline systems.