FG (FeatureGenerator) is the process of transforming raw data into features for a model. This process ensures that the output from offline and online sample generation is consistent. Feature generation uses various types of FG operators to transform one or more features.
Feature generation focuses only on transformations required for both offline and online sample generation. If a transformation is required only during the offline stage, do not define it as an FG operation. The following diagram shows the position of the FG module within a recommendation system architecture.
The feature generation process consists of a series of feature transformation operators (FG operators). These operators run in parallel according to the topology of a directed acyclic graph (DAG) defined in the configuration file.
Configuration file example
The features list is used to configure the feature operators. Each feature operator must include feature_name and feature_type. For more information about other configuration items, see Built-in feature operators.
The reserves configuration item specifies the fields to pass through during offline tasks. These fields are output directly without any feature transformation.
{
"features": [
{
"feature_name": "goods_id",
"feature_type": "id_feature",
"value_type": "string",
"expression": "item:goods_id",
"default_value": "-1024",
"need_prefix": false
},
{
"feature_name": "color_pair",
"feature_type": "combo_feature",
"value_type": "string",
"expression": ["user:query_color", "item:color"],
"default_value": "",
"need_prefix": false
},
{
"feature_name": "current_price",
"feature_type": "raw_feature",
"value_type": "double",
"expression": "item:current_price",
"default_value": "0",
"need_prefix": false
},
{
"feature_name": "usr_cate1_clk_cnt_1d",
"feature_type": "lookup_feature",
"map": "user:usr_cate1_clk_cnt_1d",
"key": "item:cate1",
"need_discrete": false,
"need_key": false,
"default_value": "0",
"combiner": "max",
"need_prefix": false,
"value_type": "double"
},
{
"feature_name": "recommend_match",
"feature_type": "overlap_feature",
"method": "is_contain",
"query": "user:query_recommend",
"title": "item:recommend",
"default_value": "0"
},
{
"feature_name": "norm_title",
"feature_type": "text_normalizer",
"expression": "item:title",
"max_length": 512,
"parameter": 0,
"remove_space": false,
"is_gbk_input": false,
"is_gbk_output": false
},
{
"feature_name": "title_terms",
"feature_type": "tokenize_feature",
"expression": "feature:norm_title",
"default_value": "",
"vocab_file": "tokenizer.json",
"output_type": "word_id",
"output_delim": ","
},
{
"feature_name": "query_title_match_ratio",
"feature_type": "overlap_feature",
"method": "query_common_ratio",
"query": "user:query_terms",
"title": "feature:title_terms",
"default_value": "0"
},
{
"feature_name": "title_term_match_ratio",
"feature_type": "overlap_feature",
"method": "title_common_ratio",
"query": "user:query_terms",
"title": "feature:title_terms",
"default_value": "0"
},
{
"feature_name": "term_proximity_min_cover",
"feature_type": "overlap_feature",
"method": "proximity_min_cover",
"query": "user:query_terms",
"title": "feature:title_terms",
"default_value": "0"
}
],
"input_alias": {
"non_exist_field1": "exist_field1",
"non_exist_field2": "exist_field2"
},
"reserves": [
"request_id",
"user_id",
"is_click",
"is_pay",
"sample_weight",
"event_unix_time"
]
}The input_alias configuration item is a dictionary that maps feature input names. It maps a field name that might not exist to an actual field name. Note: The input_alias configuration is supported from version 1.0.0. You can typically skip this configuration.
Use case 1: Setting a shorter alias for a long field name.
Use case 2: Setting an alias for the second instance when a custom feature operator uses the same parameter twice.
The same input field can be reused across different features but not within a single feature transformation. You can configure input_alias to bypass this limitation.
For example, if a custom feature operator requires the same field
Afor two different input parameters, you can configure two inputs,AandB, and then configure aninput_aliasto map"B": "A". During execution, the framework replaces the parameters for the custom feature operator from (A, B) to (A, A).
Input domains
An input domain indicates the source entity of the input. The following four types are supported:
user: User-side features, such as user profiles and user-dimension statistical features.
context: Contextual features that change frequently, such as time, location, and weather.
item: Item-side features, such as static content features and item-dimension statistical features.
feature: Indicates that the current input is the output of another feature transformation.
The feature input domain is used to configure the dependencies between feature operators, which form a DAG. The framework executes these feature transformations in parallel according to the topological order. The following figure shows the resulting topology:
By default, the output of intermediate nodes in the DAG is not included in the final FG output. You can use the stub_type configuration item to change this behavior.
Multi-value types and separators
FG supports complex input types, such as Array and Map, which are consistent with the complex types in MaxCompute.
You can use the chr(29) separator for multi-value string features.
For example, in v1^]v2^]v3, ^] represents the multi-value separator. This is a single character with the ASCII code "\x1D", not two characters. To enter this character in Emacs, press C-q C-5. In vi, press C-v C-5.
Feature binning (discretization)
The framework supports the following six types of binning operations:
hash_bucket_size: Hashes the feature transformation result and performs a modulo operation.
vocab_list: Converts the feature transformation result into an index in a list.
vocab_dict: Converts the feature transformation result into a value in a dictionary. The value must be convertible to the int64 type.
vocab_file: Reads a vocab_list or vocab_dict from a file.
boundaries: Specifies binning boundaries and converts the feature transformation result into the corresponding bucket ID.
num_buckets: Directly uses the feature transformation result as the binning bucket ID.
hash_bucket_size
Hashes the feature transformation result and performs a modulo operation. This applies to any type of feature value.
Result range: [0,
hash_bucket_size)The binning result for an empty feature is
hash(default_value)%hash_bucket_size.
{
"hash_bucket_size": 128000,
"default_value": "default value"
}vocab_list
Performs binning based on a vocabulary list. It maps the input to an index in the vocabulary. The binning result is the index of the feature value in the vocab_list array.
The element type of the
vocab_listarray must be the same as the configuredvalue_type.num_oov_bucket: A non-negative integer that specifies the number of out-of-vocabulary buckets.All out-of-vocabulary inputs are assigned IDs in the range [vocabulary_size, vocabulary_size+num_oov_bucket) based on a hash of the input value.
You cannot specify a positive num_oov_bucket with
default_bucketize_value.
default_bucketize_value: The integer ID value to return for out-of-vocabulary feature values.This cannot be specified if
num_oov_bucketsis positive.The default value is
vocab_list.size().
{
"vocab_list": [
"",
"<OOV>",
"token1",
"token2",
"token3",
"token4"
],
"num_oov_bucket": 0,
"default_bucketize_value": 1
}vocab_dict
The binning result is the value in the vocab_dict dictionary that corresponds to the feature value. This supports mapping different feature values to the same binning result.
The key type in the
vocab_dictdictionary must be the same as the configuredvalue_type.The values in
vocab_dictmust be convertible to theint64type.num_oov_bucket: A non-negative integer that specifies the number of out-of-vocabulary buckets.All out-of-vocabulary inputs are assigned IDs in the range [vocabulary_size, vocabulary_size+num_oov_bucket) based on a hash of the input value.
You cannot specify a positive num_oov_bucket with
default_bucketize_value.
default_bucketize_value: The integer ID value to return for out-of-vocabulary feature values.This cannot be specified if
num_oov_bucketsis positive.The default value is
vocab_dict.size().
{
"vocab_dict": {
"token1": 1,
"token2": 2,
"token3": 3,
"token4": 1
},
"num_oov_bucket": 0,
"default_bucketize_value": 4
}vocab_file
Reads a vocab_list or vocab_dict from a file.
{
"vocab_file": "vocab.txt",
"num_oov_bucket": 0,
"default_bucketize_value": 4
}vocab_file: The file path. The file contains a vocabulary list, with one vocabulary term per line. You can optionally specify a mapped value.
Relative paths are supported. When you deploy the online service, place the file in the same directory as
fg.json.If only a token is present, it is mapped to the line number, starting from 0. If a value is present, the token and value are separated by a whitespace character, such as a space or a tab. The value must be of the
int64type.
The meanings of
num_oov_bucketanddefault_bucketize_valueare the same as described previously.
boundaries
Bins numeric features based on specified boundaries. This method represents discretized dense input that is bucketed by boundaries.
The element type of the
boundariesarray must be the same as the configuredvalue_type.Buckets are inclusive of the left boundary and exclusive of the right boundary.
For example, boundaries=[0., 1., 2.] generates the buckets (-inf, 0), [0, 1), [1, 2), and [2, +inf).
{
"boundaries": [0.0, 1.0, 2.0],
"default_value": -1
}num_buckets
Directly uses the feature transformation result as the binning bucket ID. This is suitable for cases where the feature value can be converted to an integer.
Result range: [0,
num_buckets)If the feature value is outside the configured range, it is assigned the
default_bucketize_value.
{
"num_buckets": 128000,
"default_bucketize_value": 127999
}Built-in feature operators
The configuration method for each feature operator is different. All feature operators that can be leaf nodes of the DAG support feature binning configuration.
For more information, see Built-in feature operators.
Feature type | Description |
id_feature | Categorical feature |
raw_feature | Numerical feature |
expr_feature | Expression feature |
combo_feature | Combination feature |
lookup_feature | Dictionary lookup feature |
match_feature | Primary-secondary key dictionary lookup feature |
overlap_feature | Overlap feature |
sequence_feature | Sequence feature |
text_normalizer | Text normalization |
tokenize_feature | Text tokenization feature |
bm25_feature | BM25 text relevance feature |
kv_dot_product | KV vector inner product |
str_replace_feature | String replacement |
regex_replace_feature | Regular expression replacement |
slice_feature | Array slicing |
Operator combinations
You can combine various built-in operators by configuring a DAG to perform powerful feature transformations.
Example 1: Calculate the average of the first four elements in a sequence
{
"features": [
{
"feature_name": "top_n_prices",
"feature_type": "sequence_raw_feature",
"expression": "user:clk_prices",
"separator": ",",
"sequence_length": 4,
"stub_type": true
},
{
"feature_name": "top_n_avg_price",
"feature_type": "expr_feature",
"expression":"reduce_mean(top_n_prices)",
"default_value": "-1",
"variables":["feature:top_n_prices"]
}
]
}Example 2: Calculate the average of elements that meet a condition in a sequence
{
"features": [
{
"feature_name": "valid_list",
"feature_type": "expr_feature",
"expression":"clk_times < 10",
"variables":["user:clk_times"],
"value_dimension": 5
},
{
"feature_name": "top_n_prices",
"feature_type": "bool_mask_feature",
"expression": ["user:clk_prices", "feature:valid_list"],
"value_type": "float",
"separator": ","
},
{
"feature_name": "top_n_avg_price",
"feature_type": "expr_feature",
"expression":"reduce_mean(top_n_prices)",
"default_value": "-1",
"variables":["feature:top_n_prices"]
}
]
}Note: In the preceding example, clk_prices and clk_times are two parallel sequences.
Custom feature operators
Custom feature operators can be dynamically loaded and executed by the framework as plugins.
For more information, see Custom feature operators.
Performance optimization tips
The performance of the FG module is closely related to its configuration. The general principle is to minimize unnecessary data (feature) transformations.
If data manipulation and transformation can be performed in the offline or near-line stages, avoid performing them in the FG stage (online service).
For better performance, follow these tips.
For structured input data, prioritize using complex types from MaxCompute tables, such as Map and Array, instead of the STRING type. This reduces the overhead of string parsing.
In online services, such as EasyRec Processor or TorchEasyRec Processor, use FeatureStore with FeatureDB as the online storage to enable support for complex types.
For lookup_feature, use the Map type for the map field.
For sequence_feature, overlap_feature, and bm25_feature, use Array type inputs.
Avoid using match_feature because it does not support complex types. Use lookup_feature instead by combining pkey and skey.
Avoid the overhead of data type conversion.
Do not set the
value_typeof raw_feature to a type other than float unless you have a specific reason.For lookup_feature input, ensure that the Key type of Map matches the type of the query field.
When you configure feature binning of the
num_bucketstype, setvalue_typetoint64.If the optimal type for a data column varies in different scenarios, consider adding a copy of a different type.
For example, a column may need to be of the BIGINT type when used as a query field for
lookup_feature, but of the STRING type when used as part of acombo_feature.In this case, add a copy of the column with the appropriate type. One copy should be BIGINT, and the other should be STRING. The following is an example SQL code:
SELECT int_data, int_data as str_data FROM ....
Use feature dependencies (DAG mode) to reuse components whenever possible.
Global configurations
Configuration item | Type | Default value | Description |
USE_CITY_HASH_TO_BUCKETIZE | string | 'false' | Specifies whether to use CityHash as the hash function for feature binning. |
USE_MULTIPLICATIVE_HASH | string | 'false' | Specifies whether to use multiplicative hashing instead of the modulo operation for feature hashing. Enabling this option is recommended. |
DISABLE_FG_PRECISION | string | 'true' | Specifies whether to disable the precision constraint for floating-point features. If not disabled, FG retains only 6 decimal places for floating-point features. |
DISABLE_STRING_TRIM | string | 'false' | Specifies whether to disable trimming of leading and trailing spaces after splitting multi-value string features. |
MONITOR_CUSTOM_OP_EVERY_N_SECONDS | string | '0' | Specifies whether to monitor the performance of custom operators. This is the interval in seconds for printing performance data. |
IGNORE_CUSTOM_OP_EXCEPTION | string | 'false' | Specifies whether to ignore exceptions thrown within custom operators. |
Note: The preceding configurations must be consistent across all execution environments, including offline and online, and training and inference. Otherwise, scoring inconsistencies between offline and online environments may occur.
Hash collision rate
On a specific dataset with 26 features of different cardinalities, hash_bucket_size was set to 10 times the cardinality for each feature. The test results are as follows:
Hash type | Total feature cardinality | Total number of bins | Hash collision rate |
std::hash | 882774549 | 840065238 | 4.8381% |
cityhash | 882774549 | 840072446 | 4.8373% |
std+cityhash | 882774549 | 840075948 | 4.8369% |
cityhash+multiplicative | 882774549 | 840072195 | 4.8373% |
std+multiplicative | 882774549 | 840077306 | 4.8367% |
In summary, you can combine std::hash and MultiplicativeHash to optimize model performance. std::hash is enabled by default. MultiplicativeHash is disabled by default to ensure backward compatibility. To use it, you must enable it manually.
Additionally, CityHash is theoretically a more uniform hashing method, but it did not show a significant advantage on this dataset. You can test it further on your own datasets.
Online scoring service configuration
You can configure this through server-side environment variables. You can set them in the service configuration of EasyRec Processor or TorchEasyRec Processor.
{
"processor_envs": [
{
"name": "USE_MULTIPLICATIVE_HASH",
"value": "true"
}
]
}Offline job configuration
To execute FG offline tasks in a MaxCompute environment, see Use FG in offline tasks.
Refer to the following code:
from pyfg100 import run_on_odps
fg_task = run_on_odps.FgTask(...)
fg_task.add_fg_setting('USE_CITY_HASH_TO_BUCKETIZE', 'false')
fg_task.add_fg_setting('USE_MULTIPLICATIVE_HASH', 'true')
fg_task.run(o)Configuration when using the pyfg API
When you use the pyfg API, for example, to perform FG during training, you can configure it as follows.
import pyfg
pyfg.set_env('USE_MULTIPLICATIVE_HASH', 'true')
pyfg.set_env('USE_CITY_HASH_TO_BUCKETIZE', 'false')