Build Feature Pipelines Using FG Operators and DAG - Artificial Intelligence Recommendation

FG (FeatureGenerator) is the process of transforming raw data into features for a model. This process ensures that the output from offline and online sample generation is consistent. Feature generation uses various types of FG operators to transform one or more features.

Feature generation focuses only on transformations required for both offline and online sample generation. If a transformation is required only during the offline stage, do not define it as an FG operation. The following diagram shows the position of the FG module within a recommendation system architecture.

The feature generation process consists of a series of feature transformation operators (FG operators). These operators run in parallel according to the topology of a directed acyclic graph (DAG) defined in the configuration file.

Configuration file example

The features list is used to configure the feature operators. Each feature operator must include feature_name and feature_type. For more information about other configuration items, see Built-in feature operators.

The reserves configuration item specifies the fields to pass through during offline tasks. These fields are output directly without any feature transformation.

{
  "features": [
    {
      "feature_name": "goods_id",
      "feature_type": "id_feature",
      "value_type": "string",
      "expression": "item:goods_id",
      "default_value": "-1024",
      "need_prefix": false
    },
    {
      "feature_name": "color_pair",
      "feature_type": "combo_feature",
      "value_type": "string",
      "expression": ["user:query_color", "item:color"],
      "default_value": "",
      "need_prefix": false
    },
    {
      "feature_name": "current_price",
      "feature_type": "raw_feature",
      "value_type": "double",
      "expression": "item:current_price",
      "default_value": "0",
      "need_prefix": false
    }, 
    {
      "feature_name": "usr_cate1_clk_cnt_1d",
      "feature_type": "lookup_feature",
      "map": "user:usr_cate1_clk_cnt_1d",
      "key": "item:cate1",
      "need_discrete": false,
      "need_key": false,
      "default_value": "0",
      "combiner": "max",
      "need_prefix": false,
      "value_type": "double"
    },
    {
      "feature_name": "recommend_match",
      "feature_type": "overlap_feature",
      "method": "is_contain",
      "query": "user:query_recommend",
      "title": "item:recommend",
      "default_value": "0"
    },
    {
      "feature_name": "norm_title",
      "feature_type": "text_normalizer",
      "expression": "item:title",
      "max_length": 512,
      "parameter": 0,
      "remove_space": false,
      "is_gbk_input": false,
      "is_gbk_output": false
    },
    {
      "feature_name": "title_terms",
      "feature_type": "tokenize_feature",
      "expression": "feature:norm_title",
      "default_value": "",
      "vocab_file": "tokenizer.json",
      "output_type": "word_id",
      "output_delim": ","
    },
    {
      "feature_name": "query_title_match_ratio",
      "feature_type": "overlap_feature",
      "method": "query_common_ratio",
      "query": "user:query_terms",
      "title": "feature:title_terms",
      "default_value": "0"
    },
    {
      "feature_name": "title_term_match_ratio",
      "feature_type": "overlap_feature",
      "method": "title_common_ratio",
      "query": "user:query_terms",
      "title": "feature:title_terms",
      "default_value": "0"
    },
    {
      "feature_name": "term_proximity_min_cover",
      "feature_type": "overlap_feature",
      "method": "proximity_min_cover",
      "query": "user:query_terms",
      "title": "feature:title_terms",
      "default_value": "0"
    }
  ],
  "input_alias": {
    "non_exist_field1": "exist_field1",
    "non_exist_field2": "exist_field2"
  },
  "reserves": [
    "request_id",
    "user_id",
    "is_click",
    "is_pay",
    "sample_weight",
    "event_unix_time"
  ]
}

The input_alias configuration item is a dictionary that maps feature input names. It maps a field name that might not exist to an actual field name. Note: The input_alias configuration is supported from version 1.0.0. You can typically skip this configuration.

Use case 1: Setting a shorter alias for a long field name.
Use case 2: Setting an alias for the second instance when a custom feature operator uses the same parameter twice.

The same input field can be reused across different features but not within a single feature transformation. You can configure input_alias to bypass this limitation.

For example, if a custom feature operator requires the same field A for two different input parameters, you can configure two inputs, A and B, and then configure an input_alias to map "B": "A". During execution, the framework replaces the parameters for the custom feature operator from (A, B) to (A, A).

Input domains

An input domain indicates the source entity of the input. The following four types are supported:

user: User-side features, such as user profiles and user-dimension statistical features.
context: Contextual features that change frequently, such as time, location, and weather.
item: Item-side features, such as static content features and item-dimension statistical features.
feature: Indicates that the current input is the output of another feature transformation.

The feature input domain is used to configure the dependencies between feature operators, which form a DAG. The framework executes these feature transformations in parallel according to the topological order. The following figure shows the resulting topology:

By default, the output of intermediate nodes in the DAG is not included in the final FG output. You can use the stub_type configuration item to change this behavior.

Multi-value types and separators

FG supports complex input types, such as Array and Map, which are consistent with the complex types in MaxCompute.

You can use the chr(29) separator for multi-value string features.

For example, in v1^]v2^]v3, ^] represents the multi-value separator. This is a single character with the ASCII code "\x1D", not two characters. To enter this character in Emacs, press C-q C-5. In vi, press C-v C-5.

Feature binning (discretization)

The framework supports the following six types of binning operations:

hash_bucket_size: Hashes the feature transformation result and performs a modulo operation.
vocab_list: Converts the feature transformation result into an index in a list.
vocab_dict: Converts the feature transformation result into a value in a dictionary. The value must be convertible to the int64 type.
vocab_file: Reads a vocab_list or vocab_dict from a file.
boundaries: Specifies binning boundaries and converts the feature transformation result into the corresponding bucket ID.
num_buckets: Directly uses the feature transformation result as the binning bucket ID.

hash_bucket_size

Hashes the feature transformation result and performs a modulo operation. This applies to any type of feature value.

Result range: [0, hash_bucket_size)
The binning result for an empty feature is hash(default_value)%hash_bucket_size.

{
  "hash_bucket_size": 128000,
  "default_value": "default value"
}

vocab_list

Performs binning based on a vocabulary list. It maps the input to an index in the vocabulary. The binning result is the index of the feature value in the vocab_list array.

The element type of the vocab_list array must be the same as the configured value_type.
num_oov_bucket: A non-negative integer that specifies the number of out-of-vocabulary buckets.
- All out-of-vocabulary inputs are assigned IDs in the range [vocabulary_size, vocabulary_size+num_oov_bucket) based on a hash of the input value.
- You cannot specify a positive num_oov_bucket with default_bucketize_value.
default_bucketize_value: The integer ID value to return for out-of-vocabulary feature values.
- This cannot be specified if num_oov_buckets is positive.
- The default value is vocab_list.size().

{
  "vocab_list": [
    "",
    "<OOV>",
    "token1",
    "token2",
    "token3",
    "token4"
  ],
  "num_oov_bucket": 0,
  "default_bucketize_value": 1
}

vocab_dict

The binning result is the value in the vocab_dict dictionary that corresponds to the feature value. This supports mapping different feature values to the same binning result.

The key type in the vocab_dict dictionary must be the same as the configured value_type.
The values in vocab_dict must be convertible to the int64 type.
num_oov_bucket: A non-negative integer that specifies the number of out-of-vocabulary buckets.
- All out-of-vocabulary inputs are assigned IDs in the range [vocabulary_size, vocabulary_size+num_oov_bucket) based on a hash of the input value.
- You cannot specify a positive num_oov_bucket with default_bucketize_value.
default_bucketize_value: The integer ID value to return for out-of-vocabulary feature values.
- This cannot be specified if num_oov_buckets is positive.
- The default value is vocab_dict.size().

{
  "vocab_dict": {
    "token1": 1,
    "token2": 2,
    "token3": 3,
    "token4": 1
  },
  "num_oov_bucket": 0,
  "default_bucketize_value": 4
}

vocab_file

Reads a vocab_list or vocab_dict from a file.

{
  "vocab_file": "vocab.txt",
  "num_oov_bucket": 0,
  "default_bucketize_value": 4
}

vocab_file: The file path. The file contains a vocabulary list, with one vocabulary term per line. You can optionally specify a mapped value.
- Relative paths are supported. When you deploy the online service, place the file in the same directory as fg.json.
- If only a token is present, it is mapped to the line number, starting from 0. If a value is present, the token and value are separated by a whitespace character, such as a space or a tab. The value must be of the int64 type.
The meanings of num_oov_bucket and default_bucketize_value are the same as described previously.

boundaries

Bins numeric features based on specified boundaries. This method represents discretized dense input that is bucketed by boundaries.

The element type of the boundaries array must be the same as the configured value_type.
Buckets are inclusive of the left boundary and exclusive of the right boundary.
For example, boundaries=[0., 1., 2.] generates the buckets (-inf, 0), [0, 1), [1, 2), and [2, +inf).

{
  "boundaries": [0.0, 1.0, 2.0],
  "default_value": -1
}

num_buckets

Directly uses the feature transformation result as the binning bucket ID. This is suitable for cases where the feature value can be converted to an integer.

Result range: [0, num_buckets)
If the feature value is outside the configured range, it is assigned the default_bucketize_value.

{
  "num_buckets": 128000,
  "default_bucketize_value": 127999
}

Built-in feature operators

The configuration method for each feature operator is different. All feature operators that can be leaf nodes of the DAG support feature binning configuration.

For more information, see Built-in feature operators.

Feature type	Description
id_feature	Categorical feature
raw_feature	Numerical feature
expr_feature	Expression feature
combo_feature	Combination feature
lookup_feature	Dictionary lookup feature
match_feature	Primary-secondary key dictionary lookup feature
overlap_feature	Overlap feature
sequence_feature	Sequence feature
text_normalizer	Text normalization
tokenize_feature	Text tokenization feature
bm25_feature	BM25 text relevance feature
kv_dot_product	KV vector inner product
str_replace_feature	String replacement
regex_replace_feature	Regular expression replacement
slice_feature	Array slicing

Operator combinations

You can combine various built-in operators by configuring a DAG to perform powerful feature transformations.

Example 1: Calculate the average of the first four elements in a sequence

{
  "features": [
    {
      "feature_name": "top_n_prices",
      "feature_type": "sequence_raw_feature",
      "expression": "user:clk_prices",
      "separator": ",",
      "sequence_length": 4,
      "stub_type": true
    },
    {
      "feature_name": "top_n_avg_price",
      "feature_type": "expr_feature",
      "expression":"reduce_mean(top_n_prices)",
      "default_value": "-1",
      "variables":["feature:top_n_prices"]
    }
  ]
}

Example 2: Calculate the average of elements that meet a condition in a sequence

{
  "features": [
    {
      "feature_name": "valid_list",
      "feature_type": "expr_feature",
      "expression":"clk_times < 10",
      "variables":["user:clk_times"],
      "value_dimension": 5
    },
    {
      "feature_name": "top_n_prices",
      "feature_type": "bool_mask_feature",
      "expression": ["user:clk_prices", "feature:valid_list"],
      "value_type": "float",
      "separator": ","
    },
    {
      "feature_name": "top_n_avg_price",
      "feature_type": "expr_feature",
      "expression":"reduce_mean(top_n_prices)",
      "default_value": "-1",
      "variables":["feature:top_n_prices"]
    }
  ]
}

Note: In the preceding example, clk_prices and clk_times are two parallel sequences.

Custom feature operators

Custom feature operators can be dynamically loaded and executed by the framework as plugins.

For more information, see Custom feature operators.

Performance optimization tips

The performance of the FG module is closely related to its configuration. The general principle is to minimize unnecessary data (feature) transformations.

If data manipulation and transformation can be performed in the offline or near-line stages, avoid performing them in the FG stage (online service).

For better performance, follow these tips.

For structured input data, prioritize using complex types from MaxCompute tables, such as Map and Array, instead of the STRING type. This reduces the overhead of string parsing.
- In online services, such as EasyRec Processor or TorchEasyRec Processor, use FeatureStore with FeatureDB as the online storage to enable support for complex types.
- For lookup_feature, use the Map type for the map field.
- For sequence_feature, overlap_feature, and bm25_feature, use Array type inputs.
- Avoid using match_feature because it does not support complex types. Use lookup_feature instead by combining pkey and skey.
Avoid the overhead of data type conversion.
- Do not set the value_type of raw_feature to a type other than float unless you have a specific reason.
- For lookup_feature input, ensure that the Key type of Map matches the type of the query field.
- When you configure feature binning of the num_buckets type, set value_type to int64.
- If the optimal type for a data column varies in different scenarios, consider adding a copy of a different type.
  - For example, a column may need to be of the BIGINT type when used as a query field for lookup_feature, but of the STRING type when used as part of a combo_feature.
  - In this case, add a copy of the column with the appropriate type. One copy should be BIGINT, and the other should be STRING. The following is an example SQL code:
    - SELECT int_data, int_data as str_data FROM ....
Use feature dependencies (DAG mode) to reuse components whenever possible.

Global configurations

Configuration item	Type	Default value	Description
USE_CITY_HASH_TO_BUCKETIZE	string	'false'	Specifies whether to use CityHash as the hash function for feature binning.
USE_MULTIPLICATIVE_HASH	string	'false'	Specifies whether to use multiplicative hashing instead of the modulo operation for feature hashing. Enabling this option is recommended.
DISABLE_FG_PRECISION	string	'true'	Specifies whether to disable the precision constraint for floating-point features. If not disabled, FG retains only 6 decimal places for floating-point features.
DISABLE_STRING_TRIM	string	'false'	Specifies whether to disable trimming of leading and trailing spaces after splitting multi-value string features.
MONITOR_CUSTOM_OP_EVERY_N_SECONDS	string	'0'	Specifies whether to monitor the performance of custom operators. This is the interval in seconds for printing performance data.
IGNORE_CUSTOM_OP_EXCEPTION	string	'false'	Specifies whether to ignore exceptions thrown within custom operators.

Note: The preceding configurations must be consistent across all execution environments, including offline and online, and training and inference. Otherwise, scoring inconsistencies between offline and online environments may occur.

Hash collision rate

On a specific dataset with 26 features of different cardinalities, hash_bucket_size was set to 10 times the cardinality for each feature. The test results are as follows:

Hash type	Total feature cardinality	Total number of bins	Hash collision rate
std::hash	882774549	840065238	4.8381%
cityhash	882774549	840072446	4.8373%
std+cityhash	882774549	840075948	4.8369%
cityhash+multiplicative	882774549	840072195	4.8373%
std+multiplicative	882774549	840077306	4.8367%

In summary, you can combine std::hash and MultiplicativeHash to optimize model performance. std::hash is enabled by default. MultiplicativeHash is disabled by default to ensure backward compatibility. To use it, you must enable it manually.

Additionally, CityHash is theoretically a more uniform hashing method, but it did not show a significant advantage on this dataset. You can test it further on your own datasets.

Online scoring service configuration

You can configure this through server-side environment variables. You can set them in the service configuration of EasyRec Processor or TorchEasyRec Processor.

{
  "processor_envs": [
    {
      "name": "USE_MULTIPLICATIVE_HASH",
      "value": "true"
    }
  ]
}

Offline job configuration

To execute FG offline tasks in a MaxCompute environment, see Use FG in offline tasks.

Refer to the following code:

from pyfg100 import run_on_odps

fg_task = run_on_odps.FgTask(...)
fg_task.add_fg_setting('USE_CITY_HASH_TO_BUCKETIZE', 'false')
fg_task.add_fg_setting('USE_MULTIPLICATIVE_HASH', 'true')
fg_task.run(o)

Configuration when using the pyfg API

When you use the pyfg API, for example, to perform FG during training, you can configure it as follows.

import pyfg
pyfg.set_env('USE_MULTIPLICATIVE_HASH', 'true')
pyfg.set_env('USE_CITY_HASH_TO_BUCKETIZE', 'false')