Built-in Feature Operators Overview - Artificial Intelligence Recommendation

id_feature

Overview

The id_feature operator processes a discrete feature. It handles both a single-value discrete feature, such as a user ID, and a multi-value discrete feature, such as the colors available for an item.

Configuration

{
  "feature_type": "id_feature",
  "feature_name": "item_is_main",
  "expression": "item:is_main",
  "need_prefix": true,
  "separator": "\u001D",
  "default_value": ""
}

Parameter	Required	Description
feature_name	Yes	The name of the output feature. This name is also used as a prefix in the generated feature value.
expression	Yes	The source field used to generate the feature.
need_prefix	No	Specifies whether to prepend the `feature_name` to the output value. Valid values: `true`: The `feature_name` is prepended. `false` (default): The `feature_name` is not prepended.
value_type	No	The data type of the output feature. The default is `string`.
separator	No	The multi-value separator for the input feature. The default is `\u001D`. Only a single character is supported.
default_value	No	The default value to use when the input feature is empty.
weighted	No	Marks whether the input is in the key:value format. If set to `true`, both feature values and weights are output (Map type).
value_dimension	No	Truncates the output when a feature has multiple values. The default value is `0`, which indicates no truncation. If the value is `1`, the schema type of the output table is `value_type`. Otherwise, it is `array<value_type>`.
stub_type	No	If set to `true`, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model. The default is `false`.

This operator supports feature binning. For configuration details, see Feature binning (discretization).
This operator supports multi-value inputs of type array.

Example

The following example shows the input and output for the item:is_main feature with different configurations.

Type	Value	Output feature
int64_t	100	item_is_main_100
double	5.2	item_is_main_5.2
string	abc	item_is_main_abc
Multi-value string	abc^]bcd	[item_is_main_abc, item_is_main_bcd]
Multi-value int	123^]456	[item_is_main_123, item_is_main_456]

The ^] symbol represents the multi-value separator. This is a single character with the ASCII code "\x1D", which can also be written as "\u001d".

raw_feature

Overview

The raw_feature operator processes a continuous feature. It supports numeric types such as int, float, and double, and handles both single-value and multi-value continuous features.

Configuration

{
 "feature_type" : "raw_feature",
 "feature_name" : "ctr",
 "expression" : "item:ctr",
 "normalizer" : "method=log10"
}

Parameter	Required	Description
feature_name	Yes	Specifies the feature name.
expression	Yes	The source field the feature depends on. Valid sources are `user`, `item`, or `context`.
normalizer	No	The normalization method. For details, see the Normalizer section.
value_type	No	The data type of the output feature. Default: `float`.
separator	No	The separator for a multi-value input feature. The default is `\u001D`. The separator must be a single character.
default_value	No	The default value for an empty input feature.
value_dimension	No	The dimension of the output field. The default is `1`. This parameter can be used to truncate the output in an offline task. If the dimension is `1`, the output table schema type is `value_type`. Otherwise, the schema type is `array<value_type>`.
stub_type	No	The default is `false`. If set to `true`, this feature transform serves only as an intermediate result and is not output to the model.

This operator supports feature binning. For configuration details, see Feature binning (discretization).
This operator supports multi-value array inputs.

Example

^] represents the multi-value separator, which is a single character with the ASCII encoding "\x1D", not two characters.

Type	Value	Output feature
int64_t	100	100
double	100.1	100.1
Multi-value int	123^]456	[123, 456] (The input field's dimension must match the one specified in the `value_dimension` parameter.)

Normalizer

The raw_feature and match_feature parameters support four types of normalizers: minmax, zscore, log10, expression. The configuration and calculation methods are as follows:

minmax
Configuration example: method=minmax,min=2.1,max=2.2
formula: x' = (x - min) / (max - min)
zscore
Configuration example: method=zscore,mean=0.0,standard_deviation=10.0
formula: x' = (x - mean) / standard_deviation
log10
Configuration example: method=log10,threshold=1e-10,default=-10
formula: x' = log10(x) if x > threshold; otherwise, x' = default
expression
Configuration example: method=expression,expr=sign(x)
formula: Lets you define a custom function or expression. The input value is represented by the variable x.

expr_feature

Overview

The expr_feature operator evaluates a mathematical expression and returns the result as a feature value. This operator supports Batch Computing and Broadcasting.

Important: All inputs must be convertible to the double data type.

Configuration

{
  "feature_type" : "expr_feature",
  "feature_name" : "ctr_sigmoid",
  "value_type": "float",
  "expression" : "sigmoid(pv/(1+click))",
  "variables": ["item:pv", "item:click"]
}

When pv = 2, click = 3, the value of the preceding expression feature is 0.6224593312.

Parameter	Required	Description
feature_name	Yes	Specifies the name of the output feature.
expression	Yes	Specifies the mathematical expression to evaluate.
variables	Yes	Specifies the variables, or input fields, used in the expression. The source for each variable must be `user`, `item`, or `context`.
value_type	No	Optional. Specifies the data type of the output feature. Valid values are `float`, `double`, `int32`, and `int64`. The default value is `float`.
separator	No	Optional. Specifies the separator for multi-valued `string` input fields. The default value is `\u001D`. Only a single character is supported.
default_value	No	Optional. Specifies the default value to use when an input feature is empty.
value_dimension	No	The default value is 0, which represents the dimension of the output field and can be used to truncate or pad the output. The schema type of the output table is `value_type` if the value is 1, or `array<value_type>` otherwise.
stub_type	No	Optional. If set to `true`, this Feature Transform serves only as an intermediate result in the pipeline and is not passed to the Model. The default value is `false`.

Examples

{
    "feature_name": "expr_feat",
    "feature_type": "expr_feature",
    "value_type": "float",
    "expression": "a+b",
    "variables": ["a", "b"],
    "value_dimension": 3
}

Scalar and vector computation (Broadcasting)
- When a=1 and b=[1, 2, 6], the result is [2, 3, 7].
Vector-to-vector element-wise computation
- When a=[3, 2, 1] and b=[1, 2, 6], the result is [4, 4, 7].
Temporary Variables and Comma Expressions
- For example: x=roundp(a),(a-x)*b. In this example, x is a temporary variable and does not need to be configured in variables.
- A comma expression is evaluated from left to right, and it returns the value of the rightmost sub-expression as the final result.
- To reduce memory overhead, you can reuse existing variables as temporary variables when semantically appropriate.

Combine expression and sequence features

{
  "features": [
    {
      "feature_name": "sphere_distance",
      "feature_type": "expr_feature",
      "expression": "sphere_dist(click_id_lng,click_id_lat,j_lng,j_lat)",
      "variables": ["user:click_id_lng", "user:click_id_lat", "item:j_lng", "item:j_lat"],
      "default_value": "0",
      "value_dimension": 3,
      "stub_type": true
    },
    {
      "feature_name": "time_diff",
      "feature_type": "expr_feature",
      "variables": ["user:cur_time", "user:clk_time_seq"],
      "expression": "cur_time-clk_time_seq",
      "default_value": "0",
      "separator": ";",
      "value_dimension": 3,
      "stub_type": true
    },
    {
      "sequence_name": "click_seq",
      "sequence_length": 3,
      "sequence_delim": ";",
      "sequence_pk": "user:click_item",
      "features": [
        {
          "feature_name": "spherical_distance",
          "feature_type": "raw_feature",
          "expression": "feature:sphere_distance",
          "default_value": "0.0"
        },
        {
          "feature_name": "time_diff_seq",
          "feature_type": "id_feature",
          "expression": "feature:time_diff",
          "default_value": "0.0",
          "num_buckets": 10000
        }
      ]
    }
  ]
}

Expressions

Built-in functions (scalar)

Function name	Number of parameters	Description
rnd	0	Generates a random number in the range [0, 1).
sin	1	Calculates the sine of a number.
cos	1	Calculates the cosine of a number.
tan	1	Calculates the tangent of a number.
asin	1	Calculates the arcsine of a number.
acos	1	Calculates the arccosine of a number.
atan	1	Calculates the arctangent of a number.
sinh	1	Calculates the hyperbolic sine of a number.
cosh	1	Calculates the hyperbolic cosine of a number.
tanh	1	Calculates the hyperbolic tangent of a number.
asinh	1	Calculates the inverse hyperbolic sine of a number.
acosh	1	Calculates the inverse hyperbolic cosine of a number.
atanh	1	Calculates the inverse hyperbolic tangent of a number.
log2	1	Calculates the base-2 logarithm of a number.
log10	1	Calculates the base-10 logarithm of a number.
log	1	Calculates the natural logarithm (base e) of a number.
ln	1	Calculates the natural logarithm (base e) of a number.
exp	1	Raises Euler's number (e) to the power of a number.
sqrt	1	Calculates the square root of a number.
sign	1	Returns the sign of a number: -1 for negative, 1 for positive, or 0 for zero.
abs	1	Calculates the absolute value of a number.
rint	1	Rounds a number to the nearest integer.
round	1	Rounds a number to the nearest integer using the 'round half away from zero' method.
roundp	2	Rounds a number to a specified precision. For example, `roundp(3.14159, 2)` returns `3.14`.
mod	2	Calculates the remainder of a division.
floor	1	Rounds a number down to the nearest integer.
ceil	1	Rounds a number up to the nearest integer.
trunc	1	Truncates a number to an integer by removing its fractional part.
sigmoid	1	Calculates the sigmoid of a number.
sphere_dist	4	Calculates the spherical distance between two GPS points. Arguments: `lng1`, `lat1`, `lng2`, `lat2`.
haversine	4	Calculates the Haversine distance between two GPS points. Arguments: `lng1`, `lat1`, `lng2`, `lat2`.
min	Variable	Returns the minimum value from a list of arguments.
max	Variable	Returns the maximum value from a list of arguments.
sum	Variable	Returns the sum of all arguments.
avg	Variable	Returns the average value of all arguments.

Note: These built-in functions support Batch Computing and Broadcasting.

Built-in vector operation functions

Function name	Number of parameters	Description
len	1	Returns the length (number of elements) of a vector.
l2_norm	1	Performs L2 normalization on a vector.
squared_norm	1	Calculates the squared L2 norm of a vector.
dot	2	Calculates the dot product of two vectors.
euclid_dist	2	Calculates the Euclidean distance between two vectors.
corr	2	Calculates the Pearson correlation coefficient between two vectors.
std_dev	1	Calculates the sample standard deviation of a vector (dividing by n-1).
pop_std_dev	1	Calculates the population standard deviation of a vector (dividing by n).
variance	1	Calculates the sample variance of a vector (dividing by n-1).
pop_variance	1	Calculates the population variance of a vector (dividing by n).
reduce_min	1	Returns the minimum value in a vector.
reduce_max	1	Returns the maximum value in a vector.
reduce_sum	1	Returns the sum of all elements in a vector.
reduce_mean	1	Returns the average value of all elements in a vector.
reduce_prod	1	Returns the product of all elements in a vector.

Note: If an expression includes a built-in vector operation function, all other variables in the expression must be scalars.

Built-in binary operators

Operator	Description	Priority
=	Assignment. This special operator modifies one of its arguments and applies only to variables.	0
\|\|	Logical OR	1
&&	Logical AND	2
\|	Bitwise OR	3
&	Bitwise AND	4
<=	Less than or equal to	5
>=	Greater than or equal to	5
!=	Not equal to	5
==	Equal to	5
>	Greater than	5
<	Less than	5
+	Addition	6
-	Subtraction	6
*	Multiplication	7
/	Division	7
%	Modulo	7
^	Raises x to the power of y	8

Built-in ternary operator
Supports if-then-else logic using C-style syntax.
It uses lazy evaluation, which means it evaluates only the necessary branch of the expression.
Operator
Description
Syntax
?:
If-then-else operator
condition ? value_if_true : value_if_false
Built-in constants
Constant
Description
Value
_pi
The mathematical constant pi (π).
3.141592653589793
_e
The mathematical constant e, also known as Euler's number.
2.718281828459045

combo_feature

Overview

The combo_feature operator creates a feature combination, or a Cartesian product, from multiple input Fields or expressions. This process is also known as feature crossing. You can think of the id_feature operator as a special case of combo_feature where only one Field is used for the crossing. Typically, the Fields involved in the crossing come from different data sources, such as when crossing a user feature with an item feature.

Configuration

{
  "feature_type" : "combo_feature",
  "feature_name" : "comb_age_item",
  "expression" : ["user:age_class", "item:item_id"],
  "need_prefix": true,
  "separator": "\u001D",
  "default_value": ""
}

Parameter	Required	Description
feature_name	Yes	Specifies the prefix for the output feature.
expression	Yes	An array that specifies the source Fields the feature depends on.
need_prefix	No	Indicates whether to prepend the `feature_name` as a prefix. Valid values: `true`: Prepends the prefix. `false` (default): Does not prepend the prefix.
value_type	No	Specifies the data type of the output feature. The default value is `string`.
separator	No	Specifies the multi-value separator for input features. The default value is `\u001D`. The separator must be a single character.
default_value	No	Specifies the default value to use when an input feature is empty.
value_dimension	No	The default value is 0, which can be used in offline tasks to truncate the output. If the value is 1, the schema type of the output table is `value_type`. Otherwise, the schema type is `array<value_type>`.
stub_type	No	The default value is `false`. If set to `true`, the pipeline uses the configured feature transform only as an intermediate result and does not pass it to the model.

This operator supports Feature binning. For more information, see Feature binning (discretization).
This operator supports multi-value inputs of the array type.

Example

The ^] symbol represents the multi-value separator. This symbol is a single character with the ASCII code \x1D, not two separate characters.

user:age_class	item:item_id	Output feature
123	45678	comb_age_item_123_45678
abc, bcd	45678	[comb_age_item_abc_45678, comb_age_item_bcd_45678]
abc, bcd	12345^]45678	[comb_age_item_abc_12345, comb_age_item_abc_45678, comb_age_item_bcd_12345, comb_age_item_bcd_45678]

The number of output features is calculated as:

|F1| * |F2| * ... * |Fn|

Where |Fn| represents the number of values in the nth input Field.

lookup_feature

Overview

The lookup_feature operator is similar to match_feature. It retrieves a value from a set of key-value pairs.

This operator requires the map and key parameters:

map is a dictionary type or a field of the MultiString type, where each string has the format "k1:v1".
The key can be a field of any type. An array-type input is recommended for multiple keys. To generate a feature, the value of the key is retrieved, converted to the key type of the map, and then matched against the key-value pairs in the map field to obtain the final feature.

Configuration

{
  "feature_type": "lookup_feature",
  "feature_name": "item_match_item",
  "map": "item:item_attr",
  "key": "item:item_value",
  "need_discrete": true,
  "need_key": true
}

Parameter	Required	Description
feature_name	Yes	Specifies the prefix for the output feature.
map	Yes	Specifies the dictionary that contains the set of key-value pairs.
key	Yes	Specifies the key to look up in the dictionary.
value_type	No	Specifies the data type of the output feature. The default is `string`.
separator	No	Specifies the multi-value separator for the `key` field of type string. The default value is "\u001D", and the separator can only be a single character.
default_value	No	Specifies the default value to use when the input `key` is empty or not found in the `map`.
need_prefix	No	Controls whether to prepend the `feature_name` as a prefix to the output. `true`: The prefix is prepended. `false` (default): The prefix is not prepended.
need_key	No	Controls whether to prepend the `key` as a prefix to the output value. This parameter applies only when `value_type` is `string`. `true`: The prefix is prepended. `false` (default): The prefix is not prepended.
normalizer	No	Specifies the normalization method. This parameter works like the `normalizer` parameter in the raw_feature operator.
combiner	No	Specifies the aggregation method to merge values retrieved from multiple keys. Valid values: `sum` (default), `avg/mean`, `max`, and `min`.
need_discrete	No	Controls whether to return multiple values as a discrete array. If set to `true`, the operator outputs all retrieved values and ignores the `combiner`. Default: `false`.
value_dimension	No	Specifies the dimension of the output feature. This parameter can be used to truncate the output in an offline task. `0` (default): No truncation is performed. For a value of 1, the schema type of the output table is `value_type`, or `array<value_type>` otherwise.
stub_type	No	If set to `true`, the configured feature transform is used only as an intermediate result and is not output to the model. Default: `false`.

This operator supports Binning. For configuration instructions, see Feature Binning (Discretization).
The map parameter accepts a dictionary object, and the key parameter accepts an array.

Example

Based on the configuration above, assume the following input data:

item_attr : "k1:v1^]k2:v2^]k3:v3"

^] represents the multi-value separator. It is a single character with the ASCII encoding "\x1D", not two characters. You can enter this character in emacs by pressing C-q C-5, and in vim by pressing C-v C-5. Here, item_attr is a multi-value string.

When using a string for the map parameter, multiple key-value pairs must be provided as a multi-value string, not a single string.

item_value : "k2"

The feature transformation result is item_match_item_k2_v2.

Example with `need_prefix` set to `true`

feature_name: fg
map: {"k1:123", "k2:234", "k3:3"}
key: {"k1"}
Result: feature={"fg_123"}

Example with `need_prefix` set to `false`

map: {"k1:123", "k2:234", "k3:3"}
key: {"k1"}
Result: feature={123}

Combining results

If you provide multiple keys, you can configure the combiner parameter to merge the retrieved values. Valid aggregation methods include sum, mean, max, and min.

If you want to use a combiner, you must set need_discrete to false. In this case, the value must be a numeric type or a string that can be converted to a numeric value.

match_feature

Overview

The match_feature operator transforms features by looking up values in a two-level nested map.

Configuration

Configure this operator in JSON format.

{
  "feature_name": "user__l1_ctr_1",
  "feature_type": "match_feature",
  "category": "ALL",
  "need_discrete": false,
  "item": "item:category_level1",
  "user": "user:l1_ctr_1",
  "match_type": "hit"
}

user: The data source, which is a two-level nested map encoded as a string.
- | is the separator between items in the first-level map, and ^ is the separator between the key and value in the first-level map.
- , is the separator between items in the second-level map, and : is the separator between a key and its value.
category: The primary key for the first-level map lookup.
ALL is a wildcard character that matches all key values at this level.
item: The secondary key for the second-level map lookup.
ALL is a wildcard character that matches all key values at this level.
need_discrete
- true: The operator returns a composite string of the feature name and keys. The model uses this string as the feature and ignores the matched value.
- false (default): The operator returns only the matched feature value. The model uses this value directly.
match_type
- hit: Returns a single matched feature. The operator queries the first-level map with the category value, and then queries the resulting second-level map with the item value to get a single result. For single-level matching, you can set the key in the first-level map to ALL and also set the category parameter to ALL.
- multihit: Allows the category and item fields to use the ALL wildcard, which can return multiple matched values.
normalizer
Optional. The normalization method. It has the same meaning as the configuration with the same name in raw_feature and takes effect only when need_discreate=false.
show_category
Specifies whether to prepend the category prefix to the query result. Defaults to true when need_discrete=true and match_type=hit, and false otherwise.
show_item
Specifies whether to add the item prefix to the query result. The default value is true when need_discrete=true and match_type=hit. Otherwise, the default value is false.
value_type
Optional. Specifies the data type of the output feature. The default value is string.
separator
Optional. Specifies the multi-value separator for the key field of the string type, which defaults to "\u001D" and must be a single character.
default_value
Optional. Specifies the default value to use when the input feature is empty.
value_dimension
Optional, with a default value of 0. This parameter can be used in offline tasks to truncate the output. If the value is 1, the schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.
stub_type
Optional. The default value is false. If you set this parameter to true, the pipeline uses the configured feature transformation only as an intermediate result and does not pass it to the model.

Examples

User feature: Nested dictionary

For example, the string 50011740^50011740:0.2,36806676:0.3,122572685:0.5|50006842^16788:0.1 is converted into a two-level map as follows:

{
  "50011740": {
    "50011740": 0.2,
    "36806676": 0.3,
    "122572685": 0.5
  },
  "50006842": {
    "16788": 0.1
  }
}

`hit` match type

{
  "feature_name": "brand_hit",
  "feature_type": "match_feature",
  "category": "item:auction_root_category",
  "need_discrete": true,
  "item": "item:brand_id",
  "user": "user:user_brand_tags_hit",
  "match_type": "hit"
}

Assume the field values are as follows:

Parameter	Value
user_brand_tags_hit	50011740^107287172:0.2,36806676:0.3,122572685:0.5\|50006842^16788816:0.1,10122:0.2,29889:0.3,30068:19
auction_root_category	50006842
brand_id	30068

When need_discrete is true, the operator first queries user_brand_tags_hit with the auction_root_category value (50006842), which returns 16788816:0.1,10122:0.2,29889:0.3,30068:19. It then queries that result with the brand_id (30068) to get the value 19. The final result is brand_hit_50006842_30068_19.
When need_discrete is false, the result is 19.0.

If you use only single-layer matching, you must change the value of category in the configuration above to ALL. Assume that the fields have the following values:

Parameter	Value
user_brand_tags_hit	ALL^16788816:40,10122:40,29889:20,30068:20
brand_id	30068

When need_discrete is true, the result is brand_hit_ALL_30068_20.
When need_discrete is false, the result is 20.0.

In this case, you can also use lookup_feature or user_brand_tags_hit, and their values must be in the format "16788816:40^]10122:40^]29889:20^]30068:20". '^]' is the multi-value separator, which is the non-printable character \u001d.

Because the lookup_feature operator supports complex input types like maps and arrays, it offers better performance.

overlap_feature

Overview

The overlap_feature operator calculates string matching metrics between two text inputs. For example, in search applications, you can use it to determine if a query is contained within a title.

Method	Description
query_common_ratio	Calculates the ratio of common terms between the `query` and `title` to the total number of terms in the `query`. Returns a value in the range [0, 1].
title_common_ratio	Calculates the ratio of common terms between the `query` and `title` to the total number of terms in the `title`. Returns a value in the range [0, 1].
is_contain	Checks if the `query` is fully contained within the `title` while preserving the term order. Valid values: `0`: Not contained. `1`: Contained.
is_equal	Checks if the `query` and `title` are identical. Valid values: `0`: Not identical. `1`: Identical.
index_of	Returns the starting position of the entire `query`'s first occurrence within the `title`. Returns -1.0 if the `query` is not found.
proximity_min_cover	Calculates the proximity of `query` terms in the `title`. The returned value is in the range [0, length(title)]. A value of 0 indicates that at least one term cannot be matched.
proximity_min_dist	Calculates the proximity of `query` terms in the `title` based on the minimum pairwise distance. The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found.
proximity_max_dist	Calculates the proximity of `query` terms in the `title` based on the maximum pairwise distance. The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found.
proximity_avg_dist	Calculates the proximity of `query` terms in the `title` based on the average pairwise distance. The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found.

The calculation methods for these term proximity measures are based on the paper "An Exploration of Proximity Measures in Information Retrieval".

Assume that the Term sequence of title(document) is: t1,t2,t1,t3,t5,t4,t2,t3,t4

MinCover is defined as the length of the shortest document segment that covers each query term at least once.
MinDist (Minimum pair distance): Calculates the minimum of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, then MinDist = min(1, 2, 3) = 1.
MaxDist (Maximum pair distance): The opposite of MinDist. It calculates the maximum of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, then MaxDist = max(1, 2, 3) = 3.
AveDist (Average pair distance): Calculates the average of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, then AveDist = (1 + 2 + 3) / 3 = 2.

Note that all aggregate operators (MinDist, MaxDist, and AveDist) are defined based on the pairwise distances between matching query terms. When a document matches only one query term, MinDist, AveDist, and MaxDist are all defined as the length of the document.

Configuration

{
  "feature_type" : "overlap_feature",
  "feature_name" : "is_contain",
  "query" : "user:attr1",
  "title" : "item:attr2",
  "method" : "is_contain",
  "separator" : " ",
  "normalizer" : ""
}

Parameter	Required	Description
feature_type	Yes	The type of the feature. Must be `overlap_feature`.
feature_name	Yes	The prefix for the output feature name.
query	Yes	The source field for the `query`. This field must be a multi-value string.
title	Yes	The source field for the `title`. This field must be a multi-value string.
method	Yes	The calculation method. Valid values include `query_common_ratio`, `title_common_ratio`, `is_contain`, `is_equal`, `index_of`, and proximity measures.
separator	No	The delimiter for the input. If you do not specify a value, the default is `chr(29)`.
normalizer	No	The normalization method. This parameter has the same function as the `normalizer` parameter in the raw_feature operator.
stub_type	No	Defaults to `false`. If set to `true`, this feature is used only as an intermediate result and is not included in the final model output.

The overlap_feature operator returns a value of type float.

Example 1

Given a query of "high,high2,fiberglass,abc" and a title of "high,quality,fiberglass,tube,for,golf,bag", the operator returns the following results:

Method	Value
query_common_ratio	0.5
title_common_ratio	0.28
is_contain	0.0
is_equal	0.0

Example 2

method=index_of and title is the cat sat on the mat.

Query	Value
the cat	0.0
sat	2.0
the mat	4.0
cap	-1.0
gap	-1.0

sequence_feature

Overview

A user's behavior history is a critical feature. This history is typically represented as a Sequence, such as a click Sequence or purchase Sequence. The entities that form a Sequence can be the items themselves or their properties.

How to configure

For example, to process a user's click Sequence with a length of 50, you can extract the item_id, price, and ts features for each item in the Sequence. In this case, ts is calculated as request_time - event_time. The following example shows the configuration:

{
  "sequence_name": "click_50_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "sequence_pk": "user:click_50_seq",
  "features": [
    {
        "feature_name": "item_id",
        "feature_type": "id_feature",
        "value_type": "string",
        "expression": "item:item_id"
    },
    {
        "feature_name": "price",
        "feature_type": "raw_feature",
        "expression": "item:price"
    },
    {
        "feature_name": "ts",
        "feature_type": "raw_feature",
        "expression": "user:ts"
    },
    {
      "feature_name": "time_diff_seq",
      "feature_type": "custom_feature",
      "operator_name": "SeqExpr",
      "operator_lib_file": "3rdparty/lib64/libseq_expr.so",
      "expression": ["user:cur_time", "user:clk_time_seq"],
      "formula": "cur_time - clk_time_seq",
      "sequence_fields": ["clk_time_seq"],
      "default_value": "0",
      "value_type": "double",
      "is_op_thread_safe": false,
      "value_dimension": 1
    }
  ]
}

sequence_name: The name of the Sequence.
sequence_length: The maximum length of the Sequence.
sequence_delim: The separator between elements in the Sequence.
sequence_pk: The sequence primary key. For example, user:click_50_seq stores the 50 most recent item IDs that a user clicked. The Model Inference Service uses this field as a key to query side info.
- The request parameters for the Online Inference Service (EAS Processor) must include a feature whose key is the value of sequence_pk.
  - For example: click_50_seq: 5410233389955966;1832586 (the separator is the value of the sequence_delim configuration)
    - In the example above, the value of the click_50_seq feature is 5410233389955966;1832586.
- Item-side sub-features of the Sequence are not required in the request to the Model Inference Service.
  - The Model Inference Service uses this field as a key to query the item's side info.
  - For example, in this configuration, the item_id, price features in the sequence feature are not passed to the inference service in the request. Instead, the Processor uses the fg SDK to retrieve and concatenate these features from its item cache. This ensures that the data format is consistent with the format used during offline training.
- User-side sub-features of the Sequence are required in the request to the Model Inference Service.
  - The feature name is ${sequence_name}__${input_name}, for example: click_50_seq__ts.
  - ${input_name} is typically configured with the expression option, but this may vary for different sub-feature types. ${input_name} does not include an input domain prefix, such as item: or user:.
features: The side info of a sequence, including information such as the static attribute values of an item and behavioral time information.
- sequence_fields: Specifies the field name of the input sequence. The value is a string or a [string] array.
  - When the feature operator has only one input field, the content of that field must be a sequence. In this case, you do not need to configure sequence_fields.
  - If a feature operator has multiple input fields and you do not configure sequence_fields, all item-side features (such as item:XXX) are assumed to be sequence input fields.
- The input table for offline training must contain all columns corresponding to the sub-features.
  - When column is a sequence (refer to the rules for sequence_fields), it is named ${sequence_name}__${input_name}.
    - For example, in this sample configuration, the offline table requires four columns: click_50_seq__item_id, click_50_seq__price, click_50_seq__ts, and click_50_seq__clk_time_seq.
    - The recommended type for a column in an offline table is the array type for better performance. The string type that uses sequence_delim as an element separator is also supported.
  - When the column is not a sequence, it is named ${input_name} without a prefix.
    - For example, in this configuration, the offline table requires one non-sequence column: ${cur_time}
  - You can use the global configuration input_alias to set a shorter alias for a long column name (see the example below).
- Supports binning operations. For the configuration method, see Feature Binning (Discretization). When binning is configured, the output element type is int64, and the shape is determined by the value_dimension configuration.
- value_dimension (also abbreviated as value_dim): Specifies the dimension of each element in the Sequence. For a sequence_raw_feature, the output type is array<float> when this parameter is set to 1, and array<array<float>> for other values. For a sequence_id_feature, the output type is array<string> when this parameter is set to 1, and array<array<string>> for other values. The default value is 0.

You can configure any feature as a sub-feature of a Sequence Feature. The following example shows the configuration:

{
  "features": [
    {
      "sequence_name": "common_seq",
      "sequence_length": 50,
      "sequence_delim": ";",
      "sequence_pk": "user:click_50_seq",
      "features": [
        {
          "feature_name": "item_id",
          "feature_type": "id_feature",
          "value_type": "String",
          "expression": "item:item_id",
          "value_dimension": 1
        },
        {
          "feature_name": "price",
          "feature_type": "raw_feature",
          "expression": "item:price"
        },
        {
          "feature_name": "ts",
          "feature_type": "raw_feature",
          "expression": "user:ts"
        },
        {
          "feature_name": "expr_feat",
          "feature_type": "expr_feature",
          "expression": "a > b",
          "variables": ["item:a", "item:b"],
          "sequence_fields": "a",
          "default_value": "0",
          "value_dimension": 1
        },
        {
          "feature_name": "lookup_feat",
          "feature_type": "lookup_feature",
          "map": "user:dict",
          "key": "item:prop",
          "separator": ",",
          "default_value": "0",
          "value_type": "float",
          "combiner": "sum",
          "boundaries": [0.0, 0.15, 0.5]
        },
        {
          "feature_name": "match_feat",
          "feature_type": "match_feature",
          "user": "user:nested_dict",
          "category": "item:pkey",
          "item": "item:skey",
          "separator": "\u001D",
          "default_value": "0",
          "matchType": "hit",
          "value_type": "float",
          "value_dimension": 1
        },
        {
          "feature_name": "bm25_score",
          "feature_type": "bm25_feature",
          "separator": " ",
          "default_value": "0",
          "query": "user:query",
          "document": "item:document",
          "sequence_fields": "query",
          "document_number": 100,
          "avg_doc_length": 6,
          "term_doc_freq_dict": {
            "this": 30,
            "example": 10,
            "document": 15
          }
        },
        {
          "feature_name": "overlap_feat",
          "feature_type": "overlap_feature",
          "query": "user:query2",
          "title": "item:title2",
          "sequence_fields": "query2",
          "method": "index_of",
          "separator": " ",
          "default_value": "-1"
        },
        {
          "feature_type": "kv_dot_product",
          "feature_name": "query_doc_sim",
          "query": "user:query3",
          "document": "item:title",
          "sequence_fields": "query3",
          "separator": "|",
          "default_value": "0"
        },
        {
          "feature_name": "seg_feat",
          "feature_type": "tokenize_feature",
          "expression": "input_a",
          "default_value": "0",
          "output_type": "word",
          "tokenizer_type": "sentencepiece",
          "vocab_file": "spmodel.model"
        },
        {
          "feature_name": "txt_norm",
          "feature_type": "text_normalizer",
          "expression": "input",
          "default_value": "",
          "parameter": 28
        },
        {
          "feature_name": "seq_combo_feat",
          "feature_type": "combo_feature",
          "expression": ["user:tags", "item:cat"],
          "sequence_fields": ["tags"],
          "separator": "_",
          "default_value": "0",
          "value_dimension": 1
        },
        {
          "feature_name": "norm_str",
          "feature_type": "str_replace_feature",
          "expression": ["user:profile"],
          "default_value": "",
          "replace_file": "synonyms.txt",
          "replacements": {
            "|": "",
            "aa": "x",
            "a": "X"
          },
          "value_dimension": 1
        },
        {
          "feature_name": "query_tokens",
          "feature_type": "regex_replace_feature",
          "expression": ["user:query_tokens"],
          "default_value": "",
          "value_type": "string",
          "regex_pattern": [ "\\|", "#", "\\(.*\\)" ],
          "replacement": "",
          "value_dimension": 1
        },
        {
          "feature_name": "slice",
          "feature_type": "slice_feature",
          "value_type": "int32",
          "expression": ["context:array"],
          "slice": "0:3",
          "value_dimension": 3,
          "num_buckets": 100000
        },
        {
          "feature_name": "mask_feature",
          "feature_type": "bool_mask_feature",
          "value_type": "float",
          "expression": [
            "user:click_items",
            "item:is_valid"
          ]
        },
        {
          "feature_name": "time_diff_seq",
          "feature_type": "custom_feature",
          "operator_name": "SeqExpr",
          "operator_lib_file": "3rdparty/lib64/libseq_expr.so",
          "expression": ["user:cur_time", "user:clk_time_seq"],
          "formula": "cur_time - clk_time_seq",
          "sequence_fields": ["clk_time_seq"],
          "default_value": "0",
          "value_type": "double",
          "is_op_thread_safe": false,
          "value_dimension": 1
        }
      ]
    }
  ],
  "input_alias": {
    "common_seq__clk_time_seq": "clk_time_seq"
  }
}

Note: The input_alias parameter is used to configure an alias for an input field in the format "origin_field": "alias_field". This allows you to replace the original input field name with a shorter one.

Flattened configuration

Generally, you can create the sequence version by adding the sequence_ prefix to a non-sequence feature type (feature_type). Note that you must generally configure a default_value for sequence features.

Examples:

sequence_id_feature: The output value is of the string type. If you need a different type, use slice_feature instead.
sequence_raw_feature: The output value type is float. If you need other types, use slice_feature instead.
sequence_match_feature
sequence_lookup_feature
sequence_expr_feature
sequence_combo_feature
sequence_overlap_feature
sequence_bm25_feature
sequence_kv_dot_product
sequence_text_normalizer
sequence_tokenize_feature
sequence_combine_feature: This Feature Operator only has a Sequence version.

Special case 1: Some feature transformation types have both Sequence and non-sequence versions.

You can activate the corresponding version by configuring is_sequence: true/false.

In this case, you do not need to add the sequence_ prefix to the feature_type parameter.

Examples:

str_replace_feature
regex_replace_feature
custom_feature

Special case 2: Some feature transformation types only have a Sequence version.

In this case, the feature_type parameter does not require the sequence_ prefix.

Examples:

slice_feature
bool_mask_feature

For these two special cases, you can add the following optional parameters:

sequence_length: The maximum length of the Sequence. Any excess elements are truncated. The default value is -1, which indicates no truncation.
sequence_delim: The separator between sequence elements. The default value is ;.

The following example shows the configuration:

{
  "feature_name": "clk_seq__item_id",
  "feature_type": "sequence_id_feature",
  "sequence_name": "clk_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "expression": "item:clk_item_seq",
  "separator": "\u001D",
  "default_value": ""
},
{
  "feature_name": "clk_seq__item_price",
  "feature_type": "sequence_raw_feature",
  "sequence_name": "clk_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "expression": "item:clk_item_prices",
  "separator": "\u001D",
  "default_value": "0"
},
{
  "feature_name": "test",
  "feature_type": "sequence_lookup_feature",
  "map": "user:prefer_tags",
  "key": "item:tags",
  "sequence_length": 2,
  "separator": ",",
  "default_value": "-1024",
  "value_type": "int32",
  "normalizer": "method=expression,expr=x+1",
  "combiner": "sum",
  "default_bucketize_value": 50,
  "num_buckets": 10000
},
{
  "feature_name": "test",
  "feature_type": "sequence_combo_feature",
  "separator": "_",
  "default_value": "0",
  "expression": ["user:f1", "item:f2"],
  "hash_bucket_size": 10000
}

In the example above, the input fields clk_item_seq and clk_item_prices must be a Sequence. This can be an array or a string whose elements are separated by the character specified by sequence_delim.

With this configuration, the Online Inference Service does not query side info. You must provide the complete input in the request.
The input field names for sequence features in a flat format remain the same as configured and are not prefixed with ${sequence_name}__.

Online feature generation

You can obtain behavior sideinfo in two ways. The first way is to retrieve it from the item cache of the EasyRec Processor, using the field specified in sequence_pk as the primary key to look up item properties. The second way is to provide the corresponding field values in the request. For example, the "ts" field in the preceding configuration is calculated as request_time - event_time (the recommendation request time minus the user behavior time). Because this value changes with the request time, it must be obtained from the request.

user_features {
  key: "click_50_seq"
  value {
    string_feature: "9008721;34926279;22487529;73379;840804;911247;31999202;7421440;4911004;40866551"
  }
}

user_features {
  key: "click__ts"
  value {
    string_feature: "23;113;401363;401369;401375;401405;486678;486803;486922;486969"
  }
}

sequence_combine_feature

Introduction

The sequence_combine_feature operator combines the multiple values for each element in a sequence feature. It transforms a multi-value sequence into a single-value sequence by aggregating the multiple values of each element into a single value using a specified combiner.

Key capabilities

Multi-value combination: Combines the multiple values of each element in a sequence into a single value.
Flexible combination strategies: Supports multiple combination strategies, including sum, mean, max, min, and count.
Value Map: Supports a value map to convert string identifiers to numeric values, which is useful for processing behavioral event sequences.
Dual separator support: Supports separate configurations for the sequence delimiter and the multi-value separator.

Configuration

Basic configuration (numeric combination)

{
  "feature_name": "seq_combine_feat",
  "feature_type": "sequence_combine_feature",
  "expression": "user:behavior_seq",
  "combiner": "sum",
  "separator": "|",
  "sequence_delim": ";"
}

Configuration with Value Map (Behavioral Events)

{
  "feature_name": "behavior_score",
  "feature_type": "sequence_combine_feature",
  "expression": "user:action_events",
  "combiner": "sum",
  "separator": "|",
  "sequence_delim": ";",
  "value_map": {
    "expo": 1,
    "click": 2,
    "buy": 4
  }
}

The value map is applied first, followed by the combine operation.

Parameters

Parameter	Required	Description
feature_name	Yes	The name of the output feature.
feature_type	Yes	Specifies the feature type. Must be set to `sequence_combine_feature`.
expression	Yes	The source of the input feature.
combiner	No	The combination strategy. Possible values: `sum`, `mean`, `max`, `min`, and `count`. Default: `sum`.
value_map	No	A map for converting strings to numeric values. The value map is applied first, followed by the combine operation.
separator	No	The multi-value separator. Default: `\u001D`. Only a single character is supported.
sequence_delim	No	The sequence delimiter for string inputs. This parameter is not required for array inputs and defaults to an empty string. Only a single character is supported.
default_value	No	The default value to use when the input is empty.
stub_type	No	Default: `false`. When set to `true`, the feature is used only as an Intermediate Result and is not output to the Model.

Examples

Example 1: Basic numeric combination (sum)

Configuration:

{
  "feature_name": "score_sum",
  "feature_type": "sequence_combine_feature",
  "expression": "user:scores",
  "combiner": "sum",
  "separator": ",",
  "sequence_delim": ";"
}

Input and output:

Input	Output	Description
`"1,2,3;4,5;6"`	`[6, 9, 6]`	The operator calculates `1+2+3=6`, `4+5=9`, and `6=6`.
`"10;20,30"`	`[10, 50]`	The operator calculates `10=10` and `20+30=50`.
`["1,2,3", "4,5", "6"]`	`[6, 9, 6]`	The input is an array of strings.
`[[1,2,3], [4,5], [6]]`	`[6, 9, 6]`	The input is an array of arrays.

Example 2: Behavioral Event Sequence (with Value Map)

Configuration:

{
  "feature_name": "behavior_weight",
  "feature_type": "sequence_combine_feature",
  "expression": "user:actions",
  "combiner": "sum",
  "separator": "|",
  "sequence_delim": ";",
  "value_map": {
    "expo": 1,
    "click": 2,
    "buy": 4
  }
}

Input and output:

Input	Output	Description
`"expo\|click\|buy"`	`[7]`	The operator calculates `1+2+4=7`.
`"click"`	`[2]`	The mapped value is `2`.
`"expo\|click"`	`[3]`	The operator calculates `1+2=3`.
`"expo\|click\|buy;expo;click"`	`[7, 1, 2]`	The input string contains multiple records separated by ;.
`["expo\|click", "expo", "click\|buy"]`	`[3, 1, 6]`	The input array contains multiple records.

tokenize_feature

Overview

The tokenize_feature operator tokenizes an input string. It returns either the tokenized string or the corresponding token IDs. This operator supports tokenizer.json files from the tokenize-cpp library.

For more information about the vocabulary file format, see these resources:

1. https://github.com/huggingface/tokenizers

2. https://github.com/mlc-ai/tokenizers-cpp

Configuration

{
    "feature_name": "title_token",
    "feature_type": "tokenize_feature",
    "expression": "item:title",
    "default_value": "",
    "vocab_file": "tokenizer.json",
    "tokenizer_type": "sentencepiece",
    "output_type": "word_id",
    "output_delim": ","
}

Parameter	Required	Description
feature_name	Yes	The unique name for the output feature.
expression	Yes	Specifies the source field that the feature depends on. The source must be user, item, or context.
vocab_file	Yes	The path to the vocabulary file.
default_value	No	The default value for the input string.
tokenizer_type	No	The tokenizer type. Set this to 'sentencepiece' to use the SentencePiece tokenizer. If unspecified, the system determines the appropriate Hugging Face tokenizer based on the 'vocab_file' content.
output_type	No	`word_id`: Outputs the token IDs. `word`: Outputs the tokenized string.
output_delim	No	The separator for the `word_id` or `word` output. This parameter applies only to offline tasks.
stub_type	No	Defaults to `false`. If set to `true`, the feature transform acts only as an intermediate result in the pipeline and is not output to the model.

Example

When output_type is word_id, the operator converts an input string into a comma-separated string of token IDs.

Type	item:title	Output feature
string	It is good today!	1147,310,1175,3063,2

Vocabulary file examples

File name	Tokenizer type	Download link
bert-base-chinese-vocab.json	WordPiece	Download link
tokenizer.json	BPE	Download link
spiece.model	sentencepiece	Download link

text_normalizer

Overview

The text_normalizer operator performs Text Normalization, including case conversion, Traditional-to-Simplified Chinese conversion, full-width to half-width character conversion, special character filtering, GBK and UTF-8 encoding conversion, and Chinese character splitting.

Configuration

{
    "feature_name": "txt_norm",
    "feature_type": "text_normalizer",
    "expression": "item:title",
    "stop_char_file": "stop_char.txt",
    "max_length": 256,
    "parameter": 0,
    "remove_space": false,
    "is_gbk_input": false,
    "is_gbk_output": false
}

Parameter	Required	Description
feature_name	Yes	The feature name.
expression	Yes	The source field that the feature depends on. The source must be `user`, `item`, or `context`.
stop_char_file	No	Specifies the path to a file of special characters to remove. If omitted, the system uses its built-in list.
max_length	No	If the input text length exceeds this value, the operator skips normalization and returns the original text.
remove_space	No	Specifies whether to remove spaces.
is_gbk_input	No	Specifies whether the input is GBK-encoded. If false, the operator assumes the input is UTF-8.
is_gbk_output	No	Specifies whether the output is GBK-encoded. If false, the operator encodes the output as UTF-8.
parameter	No	Text normalization options.
default_value	No	The default value to use when the input feature is empty.

Note:

The stop_char_file must use GBK encoding.
Each line in the stop_char_file must contain only one character to ensure successful filtering.

Text normalization options

To configure the parameter field, sum the numeric values of the desired options from the list below.

For example, to convert uppercase to lowercase, full-width to half-width, Traditional to Simplified Chinese, and filter special characters, set parameter = 4 + 8 + 16 + 32 = 60.

The default value for the parameter is 60.

#define __NORMALIZED_LOWER2UPPER__ 		2 			/* Convert lowercase to uppercase. */
#define __NORMALIZED_UPPER2LOWER__ 		4 			/* Convert uppercase to lowercase. */
#define __NORMALIZED_SBC2DBC__ 			8 			/* Convert full-width to half-width characters. */
#define __NORMALIZED_BIG52GBK__			16 			/* Convert Traditional Chinese to Simplified Chinese. */
#define __NORMALIZED_FILTER__ 			32 			/* Filter special characters. */
#define __NORMALIZED_SPLITCHARS__		512 		/* Split Chinese characters into single characters, separated by spaces. */

Example

{
  "feature_name": "txt_norm",
  "feature_type": "text_normalizer",
  "expression": "input_a",
  "parameter": 28
}

Input: ["正則生成代碼", "Html過濾工具", "正則表達式語法速查", "The Cat／"]
Output: ["正则生成代码", "html过滤工具", "正则表达式语法速查", "the cat/"]

bm25_feature

Features

The BM25 (Best Matching) algorithm is a mainstream text matching algorithm in information retrieval, typically used for search relevance scoring. It first parses a query into terms $q_{i}$ . Then, for each search result D, it calculates the relevance score of each term $q_{i}$ for D. Finally, it calculates the final relevance score of the query for D as a weighted sum of the relevance scores for each term $q_{i}$ .

For Chinese, Query Tokenization serves as Morpheme Analysis, treating each Word (Term) as a Morpheme $q_{i}$ .

The general formula for the BM25 algorithm is:

$score (Q, d) = i = 1 \sum n w_{i} R (q_{i}, d)$

In this formula, $Q$ represents a query, $q_{i}$ is the $i$ -th term in the query, $d$ is a document, $w_{i}$ is the weight of $q_{i}$ , and R(qi,d) is the relevance score of $q_{i}$ to document $d$ .

Term importance

There are several methods for weighting a term's relevance to a document. A common method is Inverse Document Frequency (IDF). The formula is:

$Inverse Document Frequency (IDF) (q_{i}) = l o g \frac{N - n ( q _{i} ) + 0.5}{n ( q _{i} ) + 0.5}$

Where $N$ is the total number of documents in the corpus, and $n (q_{i})$ is the number of documents containing the term qi.

The definition of IDF shows that for a given Document Collection, the more documents that contain $q_{i}$ , the lower the weight of $q_{i}$ . In other words, if many documents contain $q_{i}$ , the Distinguishing Power of $q_{i}$ is low. Therefore, the importance of using $q_{i}$ to determine relevance is lower.

Term relevance

The relevance score between a term $q_{i}$ and a document $d$ , denoted as $R (q_{i}, d)$ , has the following general form in the BM25 algorithm:

$R (q_{i}, d) = \frac{f _{i} \cdot ( k _{1} + 1 )}{f _{i} + K} \cdot \frac{q f _{i} \cdot ( k _{2} + 1 )}{q f _{i} + k _{2}}$

$K = k_{1} \cdot (1 - b + b \cdot \frac{d l}{a vg d l})$

In this formula, $k_{1}, k_{2}, b$ are adjustment factors that are set based on experience. Typically, the values are $k_{1} = 1.2, b = 0.75, k_{2} = 0$ . $f_{i}$ is the frequency of $q_{i}$ in document $d$ , and $q f_{i}$ is the frequency of $q_{i}$ in the Query. $d l$ is the length of document $d$ , and $a vg d l$ is the average length of all documents. Because $q_{i}$ appears only once in the query in most cases, $q f_{i} = 1$ , the formula can be simplified to:

$R (q_{i}, d) = f _{i} + K$

The definition of $K$ shows that the parameter $b$ adjusts the impact of document length on relevance. The larger the value of $b$ , the greater the impact of document length on the relevance score, and vice versa. The longer the relative document length, the larger the value of $K$ , and the lower the relevance score. A longer document is more likely to contain $q_{i}$ . Therefore, for the same $f_{i}$ value, a long document has lower relevance to $q_{i}$ than a short document with $q_{i}$ .

In summary, the relevance score formula for the BM25 algorithm is as follows:

$score (Q, d) = j = 1 \sum n I D F (q_{i}) \frac{f _{i} \cdot ( k _{1} + 1 )}{f _{i} + k _{1} \cdot ( 1 - b + b \cdot \frac{d l}{a vg d l} )}$

The BM25 formula provides significant flexibility in algorithm design, allowing for various methods of calculating search relevance scores based on different approaches to tokenization, term weighting, and term-document relevance.

Configuration

{
  "feature_type": "bm25_feature",
  "feature_name": "query_doc_relevance",
  "query": "user:query",
  "document": "item:title",
  "term_doc_freq_file": "term_doc_freq.txt",
  "avg_doc_length": 100.0,
  "k1": 1.2,
  "b": 0.75,
  "separator": "\u001D",
  "default_value": ""
}

Parameter	Required	Description
feature_name	Yes	The name of the output feature.
query	Yes	The source field for the query.
document	Yes	The source field for the document.
term_doc_freq_file	No	The file path to the term document frequency data. The file contains one term and its document count per line, in the format `termdocument_count`.
term_doc_freq_dict	No	An alternative to `term_doc_freq_file`, provided as a dictionary where each key is a term and its value is the document count.
k1	No	A parameter of the BM25 algorithm, typically between 1.2 and 2.0. Default: 1.2.
b	No	A parameter of the BM25 algorithm. Default: 0.75.
separator	No	A single-character separator for multi-valued input features. Default: `\u001D`.
normalizer	No	The normalization method. For details, see the raw_feature configuration.
default_value	No	The value to use when the input feature is empty.
stub_type	No	Default: false. If `true`, the system treats this feature transformation as an intermediate result and excludes it from the final model.

The term_doc_freq_file and term_doc_freq_dict parameters are mutually exclusive. If both are specified, term_doc_freq_file takes precedence.
When using this feature in an online service, place the term_doc_freq_file in the same directory as fg.json.

kv_dot_product

Overview

Computes the dot product of two key-value vectors or the size of the intersection of two sets.

Configuration

{
  "feature_type": "kv_dot_product",
  "feature_name": "query_doc_sim",
  "query": "user:query",
  "document": "item:title",
  "separator": "|",
  "default_value": "0"
}

Parameter	Required	Description
feature_name	Yes	The name of the output feature.
query	Yes	The source of the query field.
document	Yes	The source of the document field.
separator	No	The separator for multi-value input features. The default is `\u001D`. This must be a single character.
kv_delimiter	No	The separator between key-value pairs in the input feature. The default is `:`. This must be a single character.
normalizer	No	Specifies the normalization method. For details, see the configuration of the raw_feature operator.
default_value	No	Specifies the value to use if an input feature is empty.
stub_type	No	Defaults to `false`. If `true`, this feature transformation is used only as an intermediate result and is not output to the model.

This operator supports complex input types such as arrays and maps. Use complex types for optimal performance.
If an input entry does not have a value part, its value defaults to 1.0. This behavior can be used to calculate the size of the intersection between two sets.
If you do not configure default_value, the default value is set to 0.

Example

Query	Document	Output
"a:0.5\|b:0.5"	"d:0.5\|b:0.5"	0.25
["a:0.5", "b:0.5"]	["d:0.5", "b:0.5"]	0.25
{"a":0.5, "b":0.5}	{"d":0.5, "b":0.5}	0.25
["a:0.5", "b:0.5"]	{"d":0.5, "b:0.5}	0.25
["a", "b", "c"]	["a", "b", "d"]	2.0
["a", "b", "c"]	"a\|b\|d"	2.0
["a", "b", "c"]	{"a":0.5, "b":0.5}	1.0

str_replace_feature

Overview

The str_replace_feature operator replaces all matched substrings in an input string with their specified replacements.

Note: Overlapping matches are replaced greedily.

Configuration

{
  "feature_name": "norm_str",
  "feature_type": "str_replace_feature",
  "expression": ["user:query"],
  "default_value": "",
  "replacements": {
    "brown": "box",
    "dogs": "jugs",
    "fox": "with",
    "jumped": "five",
    "over": "dozen",
    "quick": "my",
    "the": "pack",
    "the lazy": "liquor",
    "|": "",
    "aa": "x",
    "a": "X"
  },
  "value_dimension": 1
}

Parameter	Description
feature_name	Required. Specifies the name of the output feature.
expression	Required. Specifies the source field that the feature depends on.
default_value	Optional. The default value for an empty input.
replacements	Optional. Required if `replace_file` is not set. A dictionary that maps original text to replacement text.
replace_file	Optional. This parameter is required if `replacements` is not set. The value is a dictionary file where each line contains an `original text \t replacement text` pair separated by a tab character (`\t`).
is_sequence	Optional. Specifies whether the input is a sequence feature. The default value is `false`.
sequence_length	Optional. Specifies the maximum length of the sequence. The operator truncates sequences that exceed this length.
sequence_delim	Optional. Specifies the delimiter for sequence elements. This parameter applies only to string inputs.
separator	Optional. This parameter applies only when `is_sequence=true`. It specifies the single-character separator for multi-value inputs. The default value is `\u001D`.
value_dimension	Optional. Specifies the dimension of the output feature. In offline tasks, this parameter is used to truncate the output. The default value is `0`.
stub_type	Optional. When set to `true`, the operator uses the configured feature transformation only as an intermediate result in the pipeline and does not output it to the model. The default value is `false`.

You can configure both replace_file and replacements. Their replacement dictionaries are merged, and replacements has a higher priority.
This operator supports binning operations. For more information, see the Feature Binning (Discretization) documentation.
- hash_bucket_size: Hashes the feature transformation result and performs a modulo operation.
- vocab_list: Bins the input based on a vocabulary and maps the input to an index in the vocabulary.
- vocab_dict: The binning result is the value in vocab_dict that corresponds to the feature value.
- vocab_file: Reads the vocab_list or vocab_dict from a file.
This operator supports multi-value array inputs.

Example

The following table shows the execution results of the preceding configuration.

user:query	Output feature
the quick brown fox jumped over the lazy dogs	pack my box with five dozen liquor jugs
aaa	xX
Feature\|Generation\|Tool\|is\|very\|useful	FeatureGenerationToolisveryuseful

regex_replace_feature

Overview

The regex_replace_feature operator is a feature transformation that replaces substrings matching a regular expression with a specified replacement string.

You can configure multiple patterns. Substrings that match any of the specified patterns are replaced.

Configuration

{
  "feature_name": "query",
  "feature_type": "regex_replace_feature",
  "expression": ["user:query"],
  "regex_pattern": "\\|",
  "replacement": " ",
  "default_value": ""
}

Parameter	Description
feature_name	Required. Name of the output feature.
expression	Required. The source field this feature depends on.
default_value	Optional. The default value to use when the input feature is empty.
regex_pattern	Required. The regular expression for matching the text to be replaced.
replacement	Optional. The replacement string. If this parameter is left empty, the matched text is removed.
replace_all	Optional. Specifies whether to perform a global replacement. The default value is `true`. If set to `false`, only the first match is replaced.
icase	Optional. Specifies whether regular expression matching is case-sensitive. The default value is `false`.
is_sequence	Optional. Specifies whether the feature is a sequence feature. The default value is `false`.
sequence_length	Optional. Specifies the maximum length of the sequence. Sequences longer than this value are truncated.
sequence_delim	Optional. Specifies the separator between sequence elements. This parameter applies only to string inputs.
separator	Optional. This parameter applies only when `is_sequence=true`. It specifies the separator for multi-valued inputs. The default value is `\u001D`. Only a single character is allowed.
value_dimension	Optional. In offline tasks, this parameter is used to truncate the output. The default value is `0`.
stub_type	Optional. The default value is `false`. When set to `true`, the pipeline uses the configured feature transformation only as an intermediate result and does not output the result to the model.

This feature supports binning operations. For configuration details, see the Feature Binning (discretization) document:
- hash_bucket_size: Hashes and applies a modulo operation to the feature transformation result.
- vocab_list: Bins the input based on a vocabulary list and maps the input to an index in the list.
- vocab_dict: Maps the feature value to a corresponding value in the vocab_dict dictionary.
- vocab_file: Reads a vocab_list or vocab_dict from a file.
This feature supports multi-valued inputs in the form of an array.

Example

user:query	Output feature
China\|People\|Republic	China People Republic
Feature\|Generation\|Tool\|Is great	Feature Generation Tool Is great

bool_mask_feature

Overview

Filters elements using a boolean value, similar to tf.boolean_mask(tensor, mask).

It is essentially a sequence feature.

Configuration

{
  "feature_name": "mask_feature",
  "feature_type": "bool_mask_feature",
  "value_type": "float",
  "expression": [
    "user:click_items",
    "item:is_valid"
  ],
  "sequence_delim": ","
}

Parameter	Description
feature_name	Required. Specifies the prefix for the output feature.
expression	Required. A list of source fields that this feature uses. The second element in the list is the mask.
default_value	Optional. The default value to use when the input feature is empty. If omitted, the default is `0` for numeric `value_type`s.
value_type	Required. Specifies the data type of the output feature.
sequence_length	Optional. The maximum sequence length. Longer sequences are truncated.
sequence_delim	Optional. The separator for sequence elements. This parameter is only required for string inputs.
separator	Optional. The separator for multi-value inputs. Default: "\u001D". Must be a single character.
value_dimension	Optional. Default: 0. Used to truncate the output in offline tasks.
normalizer	Optional. Specifies the normalization method. This parameter applies only to numeric features. For more information, see RawFeature.
stub_type	Optional. Default: false. If set to true, the pipeline uses this feature transformation only as an intermediate result and does not output it to the model.

Supports binning. For configuration, see Feature binning (discretization).
Supports multi-value inputs that are arrays or nested arrays.

Examples

Input	Mask	Output
"123,456,90,80"	"true,false,true,false"	["123", "90"]
"123,456,90,80"	[1, 0, 1, 0]	["123", "90"]
[1, 2, 3, 4]	[1, 0, 1, 0]	[1, 3]
[1, 2, 3, 4]	"true,false,true,false"	[1, 3]

Usage with expression features

{
  "features": [
    {
      "feature_name": "mask",
      "feature_type": "expr_feature",
      "expression": "price>100",
      "variables": ["item:price"],
      "value_dimension": 3
    },
    {
      "feature_name": "filter_list",
      "feature_type": "bool_mask_feature",
      "expression": [
        "user:click_items",
        "feature:mask"
      ],
      "num_buckets": 10000
    }
  ]
}

slice_feature

Overview

This operator slices an input array using Python-style syntax or retrieves an element at a specific index.

Essentially, it is a sequence feature.

Configuration

{
  "feature_name": "test_feature",
  "feature_type": "slice_feature",
  "value_type": "float",
  "expression": [
    "user:click_items"
  ],
  "slice": "2:4"
}

Parameter	Required	Description
feature_name	Yes	The name of the output feature.
expression	Yes	The source field for the feature. The input must be a list.
slice	Yes	A single number specifies the element at the corresponding index of the input array, or you can use a slice string with the same syntax as Python in the format `start:stop:step`.
default_value	No	If an input feature is empty, the default value is used. If you do not explicitly provide a configuration, the default value is `0` when the `value_type` is a numeric type.
value_type	Yes	The data type of the output feature.
sequence_length	No	The maximum sequence length. Sequences longer than this are truncated.
sequence_delim	No	The separator for sequence elements. Required only if the input is a string.
separator	No	The separator for multi-value inputs. Defaults to `\u001D`. Only a single character is supported.
value_dimension	No	The output dimension. Defaults to `0`. In offline tasks, this parameter can truncate the output.
normalizer	No	The normalization method. Applies only to numeric features. For details, see the `raw_feature` operator.
stub_type	No	Indicates if the feature is a stub. Defaults to `false`. If `true`, the feature acts as an intermediate result and is excluded from the model output.
placeholder	No	A special value in a sequence feature that is used to fill empty slots and pad dimensions. The default value for floating-point numbers is `NaN`. For integers, the default is the minimum value of the corresponding type. For more information, see the placeholder configuration item of the custom feature operator.

This operator supports binning. For configuration details, see Feature Binning (Discretization).
This operator supports multi-value inputs, including arrays and nested arrays.

Example

When you set sequence_delim="," and value_dimension=1, the input and output are as follows:

Input	slice	Output
"123,456,90,80"	0	"123"
"123,456,90,80"	2	"90"
"123,456,90,80"	1:3	["456", "90"]
[1, 2, 3, 4]	:2	[1, 2]
[1, 2, 3, 4]	2:	[3, 4]
[1, 2, 3, 4]	1:4:2	[2, 4]
[1, 2, 3, 4]	::-1	[4, 3, 2, 1]
[1, 2, 3, 4]	2:-1:-1	[3, 2, 1]
[1, 2, 3, 4]	:	[1, 2, 3, 4]

Operator	Description	Syntax
?:	If-then-else operator	`condition ? value_if_true : value_if_false`

Constant	Description	Value
_pi	The mathematical constant pi (π).	3.141592653589793
_e	The mathematical constant e, also known as Euler's number.	2.718281828459045

id_feature

Overview

Configuration

Example

raw_feature

Overview

Configuration

Example

Normalizer

expr_feature

Overview

Configuration

Examples

Combine expression and sequence features

Expressions

combo_feature

Overview

Configuration

Example

lookup_feature

Overview

Configuration

Example

Example with need_prefix set to true

Example with need_prefix set to false

Combining results

match_feature

Overview

Configuration

Examples

User feature: Nested dictionary

hit match type

overlap_feature

Overview

Configuration

Example 1

Example 2

sequence_feature

Overview

How to configure

Flattened configuration

Online feature generation

sequence_combine_feature

Introduction

Key capabilities

Configuration

Basic configuration (numeric combination)

Configuration with Value Map (Behavioral Events)

Parameters

Examples

Example 1: Basic numeric combination (sum)

Example 2: Behavioral Event Sequence (with Value Map)

tokenize_feature

Overview

Configuration

Example

Vocabulary file examples

text_normalizer

Overview

Configuration

Text normalization options

Example

bm25_feature

Features

Term importance

Term relevance

Configuration

kv_dot_product

Overview

Configuration

Example

str_replace_feature

Overview

Configuration

Example

regex_replace_feature

Overview

Configuration

Example

bool_mask_feature

Example with `need_prefix` set to `true`

Example with `need_prefix` set to `false`

`hit` match type