All Products
Search
Document Center

Artificial Intelligence Recommendation:Built-in feature operators

Last Updated:Mar 23, 2026

id_feature

Overview

The id_feature operator processes a discrete feature. It handles both a single-value discrete feature, such as a user ID, and a multi-value discrete feature, such as the colors available for an item.

Configuration

{
  "feature_type": "id_feature",
  "feature_name": "item_is_main",
  "expression": "item:is_main",
  "need_prefix": true,
  "separator": "\u001D",
  "default_value": ""
}

Parameter

Required

Description

feature_name

Yes

The name of the output feature. This name is also used as a prefix in the generated feature value.

expression

Yes

The source field used to generate the feature.

need_prefix

No

Specifies whether to prepend the feature_name to the output value. Valid values:

  • true: The feature_name is prepended.

  • false (default): The feature_name is not prepended.

value_type

No

The data type of the output feature. The default is string.

separator

No

The multi-value separator for the input feature. The default is \u001D. Only a single character is supported.

default_value

No

The default value to use when the input feature is empty.

weighted

No

Marks whether the input is in the key:value format. If set to true, both feature values and weights are output (Map type).

value_dimension

No

Truncates the output when a feature has multiple values. The default value is 0, which indicates no truncation.

If the value is 1, the schema type of the output table is value_type. Otherwise, it is array<value_type>.

stub_type

No

If set to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model. The default is false.

  • This operator supports feature binning. For configuration details, see Feature binning (discretization).

  • This operator supports multi-value inputs of type array.

Example

The following example shows the input and output for the item:is_main feature with different configurations.

Type

Value

Output feature

int64_t

100

item_is_main_100

double

5.2

item_is_main_5.2

string

abc

item_is_main_abc

Multi-value string

abc^]bcd

[item_is_main_abc, item_is_main_bcd]

Multi-value int

123^]456

[item_is_main_123, item_is_main_456]

The ^] symbol represents the multi-value separator. This is a single character with the ASCII code "\x1D", which can also be written as "\u001d".

raw_feature

Overview

The raw_feature operator processes a continuous feature. It supports numeric types such as int, float, and double, and handles both single-value and multi-value continuous features.

Configuration

{
 "feature_type" : "raw_feature",
 "feature_name" : "ctr",
 "expression" : "item:ctr",
 "normalizer" : "method=log10"
}

Parameter

Required

Description

feature_name

Yes

Specifies the feature name.

expression

Yes

The source field the feature depends on. Valid sources are user, item, or context.

normalizer

No

The normalization method. For details, see the Normalizer section.

value_type

No

The data type of the output feature. Default: float.

separator

No

The separator for a multi-value input feature. The default is \u001D. The separator must be a single character.

default_value

No

The default value for an empty input feature.

value_dimension

No

The dimension of the output field. The default is 1. This parameter can be used to truncate the output in an offline task. If the dimension is 1, the output table schema type is value_type. Otherwise, the schema type is array<value_type>.

stub_type

No

The default is false. If set to true, this feature transform serves only as an intermediate result and is not output to the model.

Example

^] represents the multi-value separator, which is a single character with the ASCII encoding "\x1D", not two characters.

Type

Value

Output feature

int64_t

100

100

double

100.1

100.1

Multi-value int

123^]456

[123, 456] (The input field's dimension must match the one specified in the value_dimension parameter.)

Normalizer

The raw_feature and match_feature parameters support four types of normalizers: minmax, zscore, log10, expression. The configuration and calculation methods are as follows:

  • minmax

    Configuration example: method=minmax,min=2.1,max=2.2

    formula: x' = (x - min) / (max - min)

  • zscore

    Configuration example: method=zscore,mean=0.0,standard_deviation=10.0

    formula: x' = (x - mean) / standard_deviation

  • log10

    Configuration example: method=log10,threshold=1e-10,default=-10

    formula: x' = log10(x) if x > threshold; otherwise, x' = default

  • expression

    Configuration example: method=expression,expr=sign(x)

    formula: Lets you define a custom function or expression. The input value is represented by the variable x.

expr_feature

Overview

The expr_feature operator evaluates a mathematical expression and returns the result as a feature value. This operator supports Batch Computing and Broadcasting.

Important: All inputs must be convertible to the double data type.

Configuration

{
  "feature_type" : "expr_feature",
  "feature_name" : "ctr_sigmoid",
  "value_type": "float",
  "expression" : "sigmoid(pv/(1+click))",
  "variables": ["item:pv", "item:click"]
}

When pv = 2, click = 3, the value of the preceding expression feature is 0.6224593312.

Parameter

Required

Description

feature_name

Yes

Specifies the name of the output feature.

expression

Yes

Specifies the mathematical expression to evaluate.

variables

Yes

Specifies the variables, or input fields, used in the expression. The source for each variable must be user, item, or context.

value_type

No

Optional. Specifies the data type of the output feature. Valid values are float, double, int32, and int64. The default value is float.

separator

No

Optional. Specifies the separator for multi-valued string input fields. The default value is \u001D. Only a single character is supported.

default_value

No

Optional. Specifies the default value to use when an input feature is empty.

value_dimension

No

The default value is 0, which represents the dimension of the output field and can be used to truncate or pad the output. The schema type of the output table is value_type if the value is 1, or array<value_type> otherwise.

stub_type

No

Optional. If set to true, this Feature Transform serves only as an intermediate result in the pipeline and is not passed to the Model. The default value is false.

Examples

{
    "feature_name": "expr_feat",
    "feature_type": "expr_feature",
    "value_type": "float",
    "expression": "a+b",
    "variables": ["a", "b"],
    "value_dimension": 3
}
  • Scalar and vector computation (Broadcasting)

    • When a=1 and b=[1, 2, 6], the result is [2, 3, 7].

  • Vector-to-vector element-wise computation

    • When a=[3, 2, 1] and b=[1, 2, 6], the result is [4, 4, 7].

  • Temporary Variables and Comma Expressions

    • For example: x=roundp(a),(a-x)*b. In this example, x is a temporary variable and does not need to be configured in variables.

    • A comma expression is evaluated from left to right, and it returns the value of the rightmost sub-expression as the final result.

    • To reduce memory overhead, you can reuse existing variables as temporary variables when semantically appropriate.

Combine expression and sequence features

{
  "features": [
    {
      "feature_name": "sphere_distance",
      "feature_type": "expr_feature",
      "expression": "sphere_dist(click_id_lng,click_id_lat,j_lng,j_lat)",
      "variables": ["user:click_id_lng", "user:click_id_lat", "item:j_lng", "item:j_lat"],
      "default_value": "0",
      "value_dimension": 3,
      "stub_type": true
    },
    {
      "feature_name": "time_diff",
      "feature_type": "expr_feature",
      "variables": ["user:cur_time", "user:clk_time_seq"],
      "expression": "cur_time-clk_time_seq",
      "default_value": "0",
      "separator": ";",
      "value_dimension": 3,
      "stub_type": true
    },
    {
      "sequence_name": "click_seq",
      "sequence_length": 3,
      "sequence_delim": ";",
      "sequence_pk": "user:click_item",
      "features": [
        {
          "feature_name": "spherical_distance",
          "feature_type": "raw_feature",
          "expression": "feature:sphere_distance",
          "default_value": "0.0"
        },
        {
          "feature_name": "time_diff_seq",
          "feature_type": "id_feature",
          "expression": "feature:time_diff",
          "default_value": "0.0",
          "num_buckets": 10000
        }
      ]
    }
  ]
}

Expressions

  • Built-in functions (scalar)

    Function name

    Number of parameters

    Description

    rnd

    0

    Generates a random number in the range [0, 1).

    sin

    1

    Calculates the sine of a number.

    cos

    1

    Calculates the cosine of a number.

    tan

    1

    Calculates the tangent of a number.

    asin

    1

    Calculates the arcsine of a number.

    acos

    1

    Calculates the arccosine of a number.

    atan

    1

    Calculates the arctangent of a number.

    sinh

    1

    Calculates the hyperbolic sine of a number.

    cosh

    1

    Calculates the hyperbolic cosine of a number.

    tanh

    1

    Calculates the hyperbolic tangent of a number.

    asinh

    1

    Calculates the inverse hyperbolic sine of a number.

    acosh

    1

    Calculates the inverse hyperbolic cosine of a number.

    atanh

    1

    Calculates the inverse hyperbolic tangent of a number.

    log2

    1

    Calculates the base-2 logarithm of a number.

    log10

    1

    Calculates the base-10 logarithm of a number.

    log

    1

    Calculates the natural logarithm (base e) of a number.

    ln

    1

    Calculates the natural logarithm (base e) of a number.

    exp

    1

    Raises Euler's number (e) to the power of a number.

    sqrt

    1

    Calculates the square root of a number.

    sign

    1

    Returns the sign of a number: -1 for negative, 1 for positive, or 0 for zero.

    abs

    1

    Calculates the absolute value of a number.

    rint

    1

    Rounds a number to the nearest integer.

    round

    1

    Rounds a number to the nearest integer using the 'round half away from zero' method.

    roundp

    2

    Rounds a number to a specified precision. For example, roundp(3.14159, 2) returns 3.14.

    mod

    2

    Calculates the remainder of a division.

    floor

    1

    Rounds a number down to the nearest integer.

    ceil

    1

    Rounds a number up to the nearest integer.

    trunc

    1

    Truncates a number to an integer by removing its fractional part.

    sigmoid

    1

    Calculates the sigmoid of a number.

    sphere_dist

    4

    Calculates the spherical distance between two GPS points. Arguments: lng1, lat1, lng2, lat2.

    haversine

    4

    Calculates the Haversine distance between two GPS points. Arguments: lng1, lat1, lng2, lat2.

    min

    Variable

    Returns the minimum value from a list of arguments.

    max

    Variable

    Returns the maximum value from a list of arguments.

    sum

    Variable

    Returns the sum of all arguments.

    avg

    Variable

    Returns the average value of all arguments.

    Note: These built-in functions support Batch Computing and Broadcasting.

  • Built-in vector operation functions

    Function name

    Number of parameters

    Description

    len

    1

    Returns the length (number of elements) of a vector.

    l2_norm

    1

    Performs L2 normalization on a vector.

    squared_norm

    1

    Calculates the squared L2 norm of a vector.

    dot

    2

    Calculates the dot product of two vectors.

    euclid_dist

    2

    Calculates the Euclidean distance between two vectors.

    corr

    2

    Calculates the Pearson correlation coefficient between two vectors.

    std_dev

    1

    Calculates the sample standard deviation of a vector (dividing by n-1).

    pop_std_dev

    1

    Calculates the population standard deviation of a vector (dividing by n).

    variance

    1

    Calculates the sample variance of a vector (dividing by n-1).

    pop_variance

    1

    Calculates the population variance of a vector (dividing by n).

    reduce_min

    1

    Returns the minimum value in a vector.

    reduce_max

    1

    Returns the maximum value in a vector.

    reduce_sum

    1

    Returns the sum of all elements in a vector.

    reduce_mean

    1

    Returns the average value of all elements in a vector.

    reduce_prod

    1

    Returns the product of all elements in a vector.

    Note: If an expression includes a built-in vector operation function, all other variables in the expression must be scalars.

  • Built-in binary operators

    Operator

    Description

    Priority

    =

    Assignment. This special operator modifies one of its arguments and applies only to variables.

    0

    ||

    Logical OR

    1

    &&

    Logical AND

    2

    |

    Bitwise OR

    3

    &

    Bitwise AND

    4

    <=

    Less than or equal to

    5

    >=

    Greater than or equal to

    5

    !=

    Not equal to

    5

    ==

    Equal to

    5

    >

    Greater than

    5

    <

    Less than

    5

    +

    Addition

    6

    -

    Subtraction

    6

    *

    Multiplication

    7

    /

    Division

    7

    %

    Modulo

    7

    ^

    Raises x to the power of y

    8

  • Built-in ternary operator

    Supports if-then-else logic using C-style syntax.

    It uses lazy evaluation, which means it evaluates only the necessary branch of the expression.

    Operator

    Description

    Syntax

    ?:

    If-then-else operator

    condition ? value_if_true : value_if_false

  • Built-in constants

    Constant

    Description

    Value

    _pi

    The mathematical constant pi (π).

    3.141592653589793

    _e

    The mathematical constant e, also known as Euler's number.

    2.718281828459045

combo_feature

Overview

The combo_feature operator creates a feature combination, or a Cartesian product, from multiple input Fields or expressions. This process is also known as feature crossing. You can think of the id_feature operator as a special case of combo_feature where only one Field is used for the crossing. Typically, the Fields involved in the crossing come from different data sources, such as when crossing a user feature with an item feature.

Configuration

{
  "feature_type" : "combo_feature",
  "feature_name" : "comb_age_item",
  "expression" : ["user:age_class", "item:item_id"],
  "need_prefix": true,
  "separator": "\u001D",
  "default_value": ""
}

Parameter

Required

Description

feature_name

Yes

Specifies the prefix for the output feature.

expression

Yes

An array that specifies the source Fields the feature depends on.

need_prefix

No

Indicates whether to prepend the feature_name as a prefix. Valid values:

  • true: Prepends the prefix.

  • false (default): Does not prepend the prefix.

value_type

No

Specifies the data type of the output feature. The default value is string.

separator

No

Specifies the multi-value separator for input features. The default value is \u001D. The separator must be a single character.

default_value

No

Specifies the default value to use when an input feature is empty.

value_dimension

No

The default value is 0, which can be used in offline tasks to truncate the output. If the value is 1, the schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.

stub_type

No

The default value is false. If set to true, the pipeline uses the configured feature transform only as an intermediate result and does not pass it to the model.

Example

The ^] symbol represents the multi-value separator. This symbol is a single character with the ASCII code \x1D, not two separate characters.

user:age_class

item:item_id

Output feature

123

45678

comb_age_item_123_45678

abc, bcd

45678

[comb_age_item_abc_45678, comb_age_item_bcd_45678]

abc, bcd

12345^]45678

[comb_age_item_abc_12345, comb_age_item_abc_45678, comb_age_item_bcd_12345, comb_age_item_bcd_45678]

The number of output features is calculated as:

|F1| * |F2| * ... * |Fn|

Where |Fn| represents the number of values in the nth input Field.

lookup_feature

Overview

The lookup_feature operator is similar to match_feature. It retrieves a value from a set of key-value pairs.

This operator requires the map and key parameters:

  • map is a dictionary type or a field of the MultiString type, where each string has the format "k1:v1".

  • The key can be a field of any type. An array-type input is recommended for multiple keys. To generate a feature, the value of the key is retrieved, converted to the key type of the map, and then matched against the key-value pairs in the map field to obtain the final feature.

Configuration

{
  "feature_type": "lookup_feature",
  "feature_name": "item_match_item",
  "map": "item:item_attr",
  "key": "item:item_value",
  "need_discrete": true,
  "need_key": true
}

Parameter

Required

Description

feature_name

Yes

Specifies the prefix for the output feature.

map

Yes

Specifies the dictionary that contains the set of key-value pairs.

key

Yes

Specifies the key to look up in the dictionary.

value_type

No

Specifies the data type of the output feature. The default is string.

separator

No

Specifies the multi-value separator for the key field of type string. The default value is "\u001D", and the separator can only be a single character.

default_value

No

Specifies the default value to use when the input key is empty or not found in the map.

need_prefix

No

Controls whether to prepend the feature_name as a prefix to the output.

  • true: The prefix is prepended.

  • false (default): The prefix is not prepended.

need_key

No

Controls whether to prepend the key as a prefix to the output value. This parameter applies only when value_type is string.

  • true: The prefix is prepended.

  • false (default): The prefix is not prepended.

normalizer

No

Specifies the normalization method. This parameter works like the normalizer parameter in the raw_feature operator.

combiner

No

Specifies the aggregation method to merge values retrieved from multiple keys. Valid values: sum (default), avg/mean, max, and min.

need_discrete

No

Controls whether to return multiple values as a discrete array. If set to true, the operator outputs all retrieved values and ignores the combiner. Default: false.

value_dimension

No

Specifies the dimension of the output feature. This parameter can be used to truncate the output in an offline task.

  • 0 (default): No truncation is performed.

  • For a value of 1, the schema type of the output table is value_type, or array<value_type> otherwise.

stub_type

No

If set to true, the configured feature transform is used only as an intermediate result and is not output to the model. Default: false.

  • This operator supports Binning. For configuration instructions, see Feature Binning (Discretization).

  • The map parameter accepts a dictionary object, and the key parameter accepts an array.

Example

Based on the configuration above, assume the following input data:

item_attr : "k1:v1^]k2:v2^]k3:v3"

^] represents the multi-value separator. It is a single character with the ASCII encoding "\x1D", not two characters. You can enter this character in emacs by pressing C-q C-5, and in vim by pressing C-v C-5. Here, item_attr is a multi-value string.

When using a string for the map parameter, multiple key-value pairs must be provided as a multi-value string, not a single string.

item_value : "k2"

The feature transformation result is item_match_item_k2_v2.

Example with need_prefix set to true

feature_name: fg
map: {"k1:123", "k2:234", "k3:3"}
key: {"k1"}
Result: feature={"fg_123"}

Example with need_prefix set to false

map: {"k1:123", "k2:234", "k3:3"}
key: {"k1"}
Result: feature={123}

Combining results

If you provide multiple keys, you can configure the combiner parameter to merge the retrieved values. Valid aggregation methods include sum, mean, max, and min.

If you want to use a combiner, you must set need_discrete to false. In this case, the value must be a numeric type or a string that can be converted to a numeric value.

match_feature

Overview

The match_feature operator transforms features by looking up values in a two-level nested map.

Configuration

Configure this operator in JSON format.

{
  "feature_name": "user__l1_ctr_1",
  "feature_type": "match_feature",
  "category": "ALL",
  "need_discrete": false,
  "item": "item:category_level1",
  "user": "user:l1_ctr_1",
  "match_type": "hit"
}
  • user: The data source, which is a two-level nested map encoded as a string.

    • | is the separator between items in the first-level map, and ^ is the separator between the key and value in the first-level map.

    • , is the separator between items in the second-level map, and : is the separator between a key and its value.

  • category: The primary key for the first-level map lookup.

    ALL is a wildcard character that matches all key values at this level.

  • item: The secondary key for the second-level map lookup.

    ALL is a wildcard character that matches all key values at this level.

  • need_discrete

    • true: The operator returns a composite string of the feature name and keys. The model uses this string as the feature and ignores the matched value.

    • false (default): The operator returns only the matched feature value. The model uses this value directly.

  • match_type

    • hit: Returns a single matched feature. The operator queries the first-level map with the category value, and then queries the resulting second-level map with the item value to get a single result. For single-level matching, you can set the key in the first-level map to ALL and also set the category parameter to ALL.

    • multihit: Allows the category and item fields to use the ALL wildcard, which can return multiple matched values.

  • normalizer

    Optional. The normalization method. It has the same meaning as the configuration with the same name in raw_feature and takes effect only when need_discreate=false.

  • show_category

    Specifies whether to prepend the category prefix to the query result. Defaults to true when need_discrete=true and match_type=hit, and false otherwise.

  • show_item

    Specifies whether to add the item prefix to the query result. The default value is true when need_discrete=true and match_type=hit. Otherwise, the default value is false.

  • value_type

    Optional. Specifies the data type of the output feature. The default value is string.

  • separator

    Optional. Specifies the multi-value separator for the key field of the string type, which defaults to "\u001D" and must be a single character.

  • default_value

    Optional. Specifies the default value to use when the input feature is empty.

  • value_dimension

    Optional, with a default value of 0. This parameter can be used in offline tasks to truncate the output. If the value is 1, the schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.

  • stub_type

    Optional. The default value is false. If you set this parameter to true, the pipeline uses the configured feature transformation only as an intermediate result and does not pass it to the model.

Examples

User feature: Nested dictionary

For example, the string 50011740^50011740:0.2,36806676:0.3,122572685:0.5|50006842^16788:0.1 is converted into a two-level map as follows:

{
  "50011740": {
    "50011740": 0.2,
    "36806676": 0.3,
    "122572685": 0.5
  },
  "50006842": {
    "16788": 0.1
  }
}

hit match type

{
  "feature_name": "brand_hit",
  "feature_type": "match_feature",
  "category": "item:auction_root_category",
  "need_discrete": true,
  "item": "item:brand_id",
  "user": "user:user_brand_tags_hit",
  "match_type": "hit"
}

Assume the field values are as follows:

Parameter

Value

user_brand_tags_hit

50011740^107287172:0.2,36806676:0.3,122572685:0.5|50006842^16788816:0.1,10122:0.2,29889:0.3,30068:19

auction_root_category

50006842

brand_id

30068

  • When need_discrete is true, the operator first queries user_brand_tags_hit with the auction_root_category value (50006842), which returns 16788816:0.1,10122:0.2,29889:0.3,30068:19. It then queries that result with the brand_id (30068) to get the value 19. The final result is brand_hit_50006842_30068_19.

  • When need_discrete is false, the result is 19.0.

If you use only single-layer matching, you must change the value of category in the configuration above to ALL. Assume that the fields have the following values:

Parameter

Value

user_brand_tags_hit

ALL^16788816:40,10122:40,29889:20,30068:20

brand_id

30068

  • When need_discrete is true, the result is brand_hit_ALL_30068_20.

  • When need_discrete is false, the result is 20.0.

In this case, you can also use lookup_feature or user_brand_tags_hit, and their values must be in the format "16788816:40^]10122:40^]29889:20^]30068:20". '^]' is the multi-value separator, which is the non-printable character \u001d.

Because the lookup_feature operator supports complex input types like maps and arrays, it offers better performance.

overlap_feature

Overview

The overlap_feature operator calculates string matching metrics between two text inputs. For example, in search applications, you can use it to determine if a query is contained within a title.

Method

Description

query_common_ratio

Calculates the ratio of common terms between the query and title to the total number of terms in the query.

Returns a value in the range [0, 1].

title_common_ratio

Calculates the ratio of common terms between the query and title to the total number of terms in the title.

Returns a value in the range [0, 1].

is_contain

Checks if the query is fully contained within the title while preserving the term order. Valid values:

  • 0: Not contained.

  • 1: Contained.

is_equal

Checks if the query and title are identical. Valid values:

  • 0: Not identical.

  • 1: Identical.

index_of

Returns the starting position of the entire query's first occurrence within the title. Returns -1.0 if the query is not found.

proximity_min_cover

Calculates the proximity of query terms in the title.

The returned value is in the range [0, length(title)]. A value of 0 indicates that at least one term cannot be matched.

proximity_min_dist

Calculates the proximity of query terms in the title based on the minimum pairwise distance.

The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found.

proximity_max_dist

Calculates the proximity of query terms in the title based on the maximum pairwise distance.

The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found.

proximity_avg_dist

Calculates the proximity of query terms in the title based on the average pairwise distance.

The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found.

The calculation methods for these term proximity measures are based on the paper "An Exploration of Proximity Measures in Information Retrieval".

Assume that the Term sequence of title(document) is: t1,t2,t1,t3,t5,t4,t2,t3,t4

  • MinCover is defined as the length of the shortest document segment that covers each query term at least once.

  • MinDist (Minimum pair distance): Calculates the minimum of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, then MinDist = min(1, 2, 3) = 1.

  • MaxDist (Maximum pair distance): The opposite of MinDist. It calculates the maximum of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, then MaxDist = max(1, 2, 3) = 3.

  • AveDist (Average pair distance): Calculates the average of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, then AveDist = (1 + 2 + 3) / 3 = 2.

Note that all aggregate operators (MinDist, MaxDist, and AveDist) are defined based on the pairwise distances between matching query terms. When a document matches only one query term, MinDist, AveDist, and MaxDist are all defined as the length of the document.

Configuration

{
  "feature_type" : "overlap_feature",
  "feature_name" : "is_contain",
  "query" : "user:attr1",
  "title" : "item:attr2",
  "method" : "is_contain",
  "separator" : " ",
  "normalizer" : ""
}

Parameter

Required

Description

feature_type

Yes

The type of the feature. Must be overlap_feature.

feature_name

Yes

The prefix for the output feature name.

query

Yes

The source field for the query. This field must be a multi-value string.

title

Yes

The source field for the title. This field must be a multi-value string.

method

Yes

The calculation method. Valid values include query_common_ratio, title_common_ratio, is_contain, is_equal, index_of, and proximity measures.

separator

No

The delimiter for the input. If you do not specify a value, the default is chr(29).

normalizer

No

The normalization method. This parameter has the same function as the normalizer parameter in the raw_feature operator.

stub_type

No

Defaults to false. If set to true, this feature is used only as an intermediate result and is not included in the final model output.

The overlap_feature operator returns a value of type float.

Example 1

Given a query of "high,high2,fiberglass,abc" and a title of "high,quality,fiberglass,tube,for,golf,bag", the operator returns the following results:

Method

Value

query_common_ratio

0.5

title_common_ratio

0.28

is_contain

0.0

is_equal

0.0

Example 2

method=index_of and title is the cat sat on the mat.

Query

Value

the cat

0.0

sat

2.0

the mat

4.0

cap

-1.0

gap

-1.0

sequence_feature

Overview

A user's behavior history is a critical feature. This history is typically represented as a Sequence, such as a click Sequence or purchase Sequence. The entities that form a Sequence can be the items themselves or their properties.

How to configure

For example, to process a user's click Sequence with a length of 50, you can extract the item_id, price, and ts features for each item in the Sequence. In this case, ts is calculated as request_time - event_time. The following example shows the configuration:

{
  "sequence_name": "click_50_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "sequence_pk": "user:click_50_seq",
  "features": [
    {
        "feature_name": "item_id",
        "feature_type": "id_feature",
        "value_type": "string",
        "expression": "item:item_id"
    },
    {
        "feature_name": "price",
        "feature_type": "raw_feature",
        "expression": "item:price"
    },
    {
        "feature_name": "ts",
        "feature_type": "raw_feature",
        "expression": "user:ts"
    },
    {
      "feature_name": "time_diff_seq",
      "feature_type": "custom_feature",
      "operator_name": "SeqExpr",
      "operator_lib_file": "3rdparty/lib64/libseq_expr.so",
      "expression": ["user:cur_time", "user:clk_time_seq"],
      "formula": "cur_time - clk_time_seq",
      "sequence_fields": ["clk_time_seq"],
      "default_value": "0",
      "value_type": "double",
      "is_op_thread_safe": false,
      "value_dimension": 1
    }
  ]
}
  • sequence_name: The name of the Sequence.

  • sequence_length: The maximum length of the Sequence.

  • sequence_delim: The separator between elements in the Sequence.

  • sequence_pk: The sequence primary key. For example, user:click_50_seq stores the 50 most recent item IDs that a user clicked. The Model Inference Service uses this field as a key to query side info.

    • The request parameters for the Online Inference Service (EAS Processor) must include a feature whose key is the value of sequence_pk.

      • For example: click_50_seq: 5410233389955966;1832586 (the separator is the value of the sequence_delim configuration)

        • In the example above, the value of the click_50_seq feature is 5410233389955966;1832586.

    • Item-side sub-features of the Sequence are not required in the request to the Model Inference Service.

      • The Model Inference Service uses this field as a key to query the item's side info.

      • For example, in this configuration, the item_id, price features in the sequence feature are not passed to the inference service in the request. Instead, the Processor uses the fg SDK to retrieve and concatenate these features from its item cache. This ensures that the data format is consistent with the format used during offline training.

    • User-side sub-features of the Sequence are required in the request to the Model Inference Service.

      • The feature name is ${sequence_name}__${input_name}, for example: click_50_seq__ts.

      • ${input_name} is typically configured with the expression option, but this may vary for different sub-feature types. ${input_name} does not include an input domain prefix, such as item: or user:.

  • features: The side info of a sequence, including information such as the static attribute values of an item and behavioral time information.

    • sequence_fields: Specifies the field name of the input sequence. The value is a string or a [string] array.

      • When the feature operator has only one input field, the content of that field must be a sequence. In this case, you do not need to configure sequence_fields.

      • If a feature operator has multiple input fields and you do not configure sequence_fields, all item-side features (such as item:XXX) are assumed to be sequence input fields.

    • The input table for offline training must contain all columns corresponding to the sub-features.

      • When column is a sequence (refer to the rules for sequence_fields), it is named ${sequence_name}__${input_name}.

        • For example, in this sample configuration, the offline table requires four columns: click_50_seq__item_id, click_50_seq__price, click_50_seq__ts, and click_50_seq__clk_time_seq.

        • The recommended type for a column in an offline table is the array type for better performance. The string type that uses sequence_delim as an element separator is also supported.

      • When the column is not a sequence, it is named ${input_name} without a prefix.

        • For example, in this configuration, the offline table requires one non-sequence column: ${cur_time}

      • You can use the global configuration input_alias to set a shorter alias for a long column name (see the example below).

    • Supports binning operations. For the configuration method, see Feature Binning (Discretization). When binning is configured, the output element type is int64, and the shape is determined by the value_dimension configuration.

    • value_dimension (also abbreviated as value_dim): Specifies the dimension of each element in the Sequence. For a sequence_raw_feature, the output type is array<float> when this parameter is set to 1, and array<array<float>> for other values. For a sequence_id_feature, the output type is array<string> when this parameter is set to 1, and array<array<string>> for other values. The default value is 0.

You can configure any feature as a sub-feature of a Sequence Feature. The following example shows the configuration:

{
  "features": [
    {
      "sequence_name": "common_seq",
      "sequence_length": 50,
      "sequence_delim": ";",
      "sequence_pk": "user:click_50_seq",
      "features": [
        {
          "feature_name": "item_id",
          "feature_type": "id_feature",
          "value_type": "String",
          "expression": "item:item_id",
          "value_dimension": 1
        },
        {
          "feature_name": "price",
          "feature_type": "raw_feature",
          "expression": "item:price"
        },
        {
          "feature_name": "ts",
          "feature_type": "raw_feature",
          "expression": "user:ts"
        },
        {
          "feature_name": "expr_feat",
          "feature_type": "expr_feature",
          "expression": "a > b",
          "variables": ["item:a", "item:b"],
          "sequence_fields": "a",
          "default_value": "0",
          "value_dimension": 1
        },
        {
          "feature_name": "lookup_feat",
          "feature_type": "lookup_feature",
          "map": "user:dict",
          "key": "item:prop",
          "separator": ",",
          "default_value": "0",
          "value_type": "float",
          "combiner": "sum",
          "boundaries": [0.0, 0.15, 0.5]
        },
        {
          "feature_name": "match_feat",
          "feature_type": "match_feature",
          "user": "user:nested_dict",
          "category": "item:pkey",
          "item": "item:skey",
          "separator": "\u001D",
          "default_value": "0",
          "matchType": "hit",
          "value_type": "float",
          "value_dimension": 1
        },
        {
          "feature_name": "bm25_score",
          "feature_type": "bm25_feature",
          "separator": " ",
          "default_value": "0",
          "query": "user:query",
          "document": "item:document",
          "sequence_fields": "query",
          "document_number": 100,
          "avg_doc_length": 6,
          "term_doc_freq_dict": {
            "this": 30,
            "example": 10,
            "document": 15
          }
        },
        {
          "feature_name": "overlap_feat",
          "feature_type": "overlap_feature",
          "query": "user:query2",
          "title": "item:title2",
          "sequence_fields": "query2",
          "method": "index_of",
          "separator": " ",
          "default_value": "-1"
        },
        {
          "feature_type": "kv_dot_product",
          "feature_name": "query_doc_sim",
          "query": "user:query3",
          "document": "item:title",
          "sequence_fields": "query3",
          "separator": "|",
          "default_value": "0"
        },
        {
          "feature_name": "seg_feat",
          "feature_type": "tokenize_feature",
          "expression": "input_a",
          "default_value": "0",
          "output_type": "word",
          "tokenizer_type": "sentencepiece",
          "vocab_file": "spmodel.model"
        },
        {
          "feature_name": "txt_norm",
          "feature_type": "text_normalizer",
          "expression": "input",
          "default_value": "",
          "parameter": 28
        },
        {
          "feature_name": "seq_combo_feat",
          "feature_type": "combo_feature",
          "expression": ["user:tags", "item:cat"],
          "sequence_fields": ["tags"],
          "separator": "_",
          "default_value": "0",
          "value_dimension": 1
        },
        {
          "feature_name": "norm_str",
          "feature_type": "str_replace_feature",
          "expression": ["user:profile"],
          "default_value": "",
          "replace_file": "synonyms.txt",
          "replacements": {
            "|": "",
            "aa": "x",
            "a": "X"
          },
          "value_dimension": 1
        },
        {
          "feature_name": "query_tokens",
          "feature_type": "regex_replace_feature",
          "expression": ["user:query_tokens"],
          "default_value": "",
          "value_type": "string",
          "regex_pattern": [ "\\|", "#", "\\(.*\\)" ],
          "replacement": "",
          "value_dimension": 1
        },
        {
          "feature_name": "slice",
          "feature_type": "slice_feature",
          "value_type": "int32",
          "expression": ["context:array"],
          "slice": "0:3",
          "value_dimension": 3,
          "num_buckets": 100000
        },
        {
          "feature_name": "mask_feature",
          "feature_type": "bool_mask_feature",
          "value_type": "float",
          "expression": [
            "user:click_items",
            "item:is_valid"
          ]
        },
        {
          "feature_name": "time_diff_seq",
          "feature_type": "custom_feature",
          "operator_name": "SeqExpr",
          "operator_lib_file": "3rdparty/lib64/libseq_expr.so",
          "expression": ["user:cur_time", "user:clk_time_seq"],
          "formula": "cur_time - clk_time_seq",
          "sequence_fields": ["clk_time_seq"],
          "default_value": "0",
          "value_type": "double",
          "is_op_thread_safe": false,
          "value_dimension": 1
        }
      ]
    }
  ],
  "input_alias": {
    "common_seq__clk_time_seq": "clk_time_seq"
  }
}

Note: The input_alias parameter is used to configure an alias for an input field in the format "origin_field": "alias_field". This allows you to replace the original input field name with a shorter one.

Flattened configuration

Generally, you can create the sequence version by adding the sequence_ prefix to a non-sequence feature type (feature_type). Note that you must generally configure a default_value for sequence features.

Examples:

Special case 1: Some feature transformation types have both Sequence and non-sequence versions.

You can activate the corresponding version by configuring is_sequence: true/false.

In this case, you do not need to add the sequence_ prefix to the feature_type parameter.

Examples:

Special case 2: Some feature transformation types only have a Sequence version.

In this case, the feature_type parameter does not require the sequence_ prefix.

Examples:

For these two special cases, you can add the following optional parameters:

  • sequence_length: The maximum length of the Sequence. Any excess elements are truncated. The default value is -1, which indicates no truncation.

  • sequence_delim: The separator between sequence elements. The default value is ;.

The following example shows the configuration:

{
  "feature_name": "clk_seq__item_id",
  "feature_type": "sequence_id_feature",
  "sequence_name": "clk_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "expression": "item:clk_item_seq",
  "separator": "\u001D",
  "default_value": ""
},
{
  "feature_name": "clk_seq__item_price",
  "feature_type": "sequence_raw_feature",
  "sequence_name": "clk_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "expression": "item:clk_item_prices",
  "separator": "\u001D",
  "default_value": "0"
},
{
  "feature_name": "test",
  "feature_type": "sequence_lookup_feature",
  "map": "user:prefer_tags",
  "key": "item:tags",
  "sequence_length": 2,
  "separator": ",",
  "default_value": "-1024",
  "value_type": "int32",
  "normalizer": "method=expression,expr=x+1",
  "combiner": "sum",
  "default_bucketize_value": 50,
  "num_buckets": 10000
},
{
  "feature_name": "test",
  "feature_type": "sequence_combo_feature",
  "separator": "_",
  "default_value": "0",
  "expression": ["user:f1", "item:f2"],
  "hash_bucket_size": 10000
}

In the example above, the input fields clk_item_seq and clk_item_prices must be a Sequence. This can be an array or a string whose elements are separated by the character specified by sequence_delim.

  • With this configuration, the Online Inference Service does not query side info. You must provide the complete input in the request.

  • The input field names for sequence features in a flat format remain the same as configured and are not prefixed with ${sequence_name}__.

Online feature generation

You can obtain behavior sideinfo in two ways. The first way is to retrieve it from the item cache of the EasyRec Processor, using the field specified in sequence_pk as the primary key to look up item properties. The second way is to provide the corresponding field values in the request. For example, the "ts" field in the preceding configuration is calculated as request_time - event_time (the recommendation request time minus the user behavior time). Because this value changes with the request time, it must be obtained from the request.

user_features {
  key: "click_50_seq"
  value {
    string_feature: "9008721;34926279;22487529;73379;840804;911247;31999202;7421440;4911004;40866551"
  }
}

user_features {
  key: "click__ts"
  value {
    string_feature: "23;113;401363;401369;401375;401405;486678;486803;486922;486969"
  }
}

sequence_combine_feature

Introduction

The sequence_combine_feature operator combines the multiple values for each element in a sequence feature. It transforms a multi-value sequence into a single-value sequence by aggregating the multiple values of each element into a single value using a specified combiner.

Key capabilities

  • Multi-value combination: Combines the multiple values of each element in a sequence into a single value.

  • Flexible combination strategies: Supports multiple combination strategies, including sum, mean, max, min, and count.

  • Value Map: Supports a value map to convert string identifiers to numeric values, which is useful for processing behavioral event sequences.

  • Dual separator support: Supports separate configurations for the sequence delimiter and the multi-value separator.

Configuration

Basic configuration (numeric combination)

{
  "feature_name": "seq_combine_feat",
  "feature_type": "sequence_combine_feature",
  "expression": "user:behavior_seq",
  "combiner": "sum",
  "separator": "|",
  "sequence_delim": ";"
}

Configuration with Value Map (Behavioral Events)

{
  "feature_name": "behavior_score",
  "feature_type": "sequence_combine_feature",
  "expression": "user:action_events",
  "combiner": "sum",
  "separator": "|",
  "sequence_delim": ";",
  "value_map": {
    "expo": 1,
    "click": 2,
    "buy": 4
  }
}

The value map is applied first, followed by the combine operation.

Parameters

Parameter

Required

Description

feature_name

Yes

The name of the output feature.

feature_type

Yes

Specifies the feature type. Must be set to sequence_combine_feature.

expression

Yes

The source of the input feature.

combiner

No

The combination strategy. Possible values: sum, mean, max, min, and count. Default: sum.

value_map

No

A map for converting strings to numeric values. The value map is applied first, followed by the combine operation.

separator

No

The multi-value separator. Default: \u001D. Only a single character is supported.

sequence_delim

No

The sequence delimiter for string inputs. This parameter is not required for array inputs and defaults to an empty string. Only a single character is supported.

default_value

No

The default value to use when the input is empty.

stub_type

No

Default: false. When set to true, the feature is used only as an Intermediate Result and is not output to the Model.

Examples

Example 1: Basic numeric combination (sum)

Configuration:

{
  "feature_name": "score_sum",
  "feature_type": "sequence_combine_feature",
  "expression": "user:scores",
  "combiner": "sum",
  "separator": ",",
  "sequence_delim": ";"
}

Input and output:

Input

Output

Description

"1,2,3;4,5;6"

[6, 9, 6]

The operator calculates 1+2+3=6, 4+5=9, and 6=6.

"10;20,30"

[10, 50]

The operator calculates 10=10 and 20+30=50.

["1,2,3", "4,5", "6"]

[6, 9, 6]

The input is an array of strings.

[[1,2,3], [4,5], [6]]

[6, 9, 6]

The input is an array of arrays.

Example 2: Behavioral Event Sequence (with Value Map)

Configuration:

{
  "feature_name": "behavior_weight",
  "feature_type": "sequence_combine_feature",
  "expression": "user:actions",
  "combiner": "sum",
  "separator": "|",
  "sequence_delim": ";",
  "value_map": {
    "expo": 1,
    "click": 2,
    "buy": 4
  }
}

Input and output:

Input

Output

Description

"expo|click|buy"

[7]

The operator calculates 1+2+4=7.

"click"

[2]

The mapped value is 2.

"expo|click"

[3]

The operator calculates 1+2=3.

"expo|click|buy;expo;click"

[7, 1, 2]

The input string contains multiple records separated by ;.

["expo|click", "expo", "click|buy"]

[3, 1, 6]

The input array contains multiple records.

tokenize_feature

Overview

The tokenize_feature operator tokenizes an input string. It returns either the tokenized string or the corresponding token IDs. This operator supports tokenizer.json files from the tokenize-cpp library.

For more information about the vocabulary file format, see these resources:

1. https://github.com/huggingface/tokenizers

2. https://github.com/mlc-ai/tokenizers-cpp

Configuration

{
    "feature_name": "title_token",
    "feature_type": "tokenize_feature",
    "expression": "item:title",
    "default_value": "",
    "vocab_file": "tokenizer.json",
    "tokenizer_type": "sentencepiece",
    "output_type": "word_id",
    "output_delim": ","
}

Parameter

Required

Description

feature_name

Yes

The unique name for the output feature.

expression

Yes

Specifies the source field that the feature depends on. The source must be user, item, or context.

vocab_file

Yes

The path to the vocabulary file.

default_value

No

The default value for the input string.

tokenizer_type

No

The tokenizer type. Set this to 'sentencepiece' to use the SentencePiece tokenizer. If unspecified, the system determines the appropriate Hugging Face tokenizer based on the 'vocab_file' content.

output_type

No

  • word_id: Outputs the token IDs.

  • word: Outputs the tokenized string.

output_delim

No

The separator for the word_id or word output. This parameter applies only to offline tasks.

stub_type

No

Defaults to false. If set to true, the feature transform acts only as an intermediate result in the pipeline and is not output to the model.

Example

When output_type is word_id, the operator converts an input string into a comma-separated string of token IDs.

Type

item:title

Output feature

string

It is good today!

1147,310,1175,3063,2

Vocabulary file examples

File name

Tokenizer type

Download link

bert-base-chinese-vocab.json

WordPiece

Download link

tokenizer.json

BPE

Download link

spiece.model

sentencepiece

Download link

text_normalizer

Overview

The text_normalizer operator performs Text Normalization, including case conversion, Traditional-to-Simplified Chinese conversion, full-width to half-width character conversion, special character filtering, GBK and UTF-8 encoding conversion, and Chinese character splitting.

Configuration

{
    "feature_name": "txt_norm",
    "feature_type": "text_normalizer",
    "expression": "item:title",
    "stop_char_file": "stop_char.txt",
    "max_length": 256,
    "parameter": 0,
    "remove_space": false,
    "is_gbk_input": false,
    "is_gbk_output": false
}

Parameter

Required

Description

feature_name

Yes

The feature name.

expression

Yes

The source field that the feature depends on. The source must be user, item, or context.

stop_char_file

No

Specifies the path to a file of special characters to remove. If omitted, the system uses its built-in list.

max_length

No

If the input text length exceeds this value, the operator skips normalization and returns the original text.

remove_space

No

Specifies whether to remove spaces.

is_gbk_input

No

Specifies whether the input is GBK-encoded. If false, the operator assumes the input is UTF-8.

is_gbk_output

No

Specifies whether the output is GBK-encoded. If false, the operator encodes the output as UTF-8.

parameter

No

Text normalization options.

default_value

No

The default value to use when the input feature is empty.

Note:

  • The stop_char_file must use GBK encoding.

  • Each line in the stop_char_file must contain only one character to ensure successful filtering.

Text normalization options

To configure the parameter field, sum the numeric values of the desired options from the list below.

For example, to convert uppercase to lowercase, full-width to half-width, Traditional to Simplified Chinese, and filter special characters, set parameter = 4 + 8 + 16 + 32 = 60.

The default value for the parameter is 60.

#define __NORMALIZED_LOWER2UPPER__ 		2 			/* Convert lowercase to uppercase. */
#define __NORMALIZED_UPPER2LOWER__ 		4 			/* Convert uppercase to lowercase. */
#define __NORMALIZED_SBC2DBC__ 			8 			/* Convert full-width to half-width characters. */
#define __NORMALIZED_BIG52GBK__			16 			/* Convert Traditional Chinese to Simplified Chinese. */
#define __NORMALIZED_FILTER__ 			32 			/* Filter special characters. */
#define __NORMALIZED_SPLITCHARS__		512 		/* Split Chinese characters into single characters, separated by spaces. */

Example

{
  "feature_name": "txt_norm",
  "feature_type": "text_normalizer",
  "expression": "input_a",
  "parameter": 28
}
  • Input: ["正則生成代碼", "Html過濾工具", "正則表達式語法速查", "The Cat/"]

  • Output: ["正则生成代码", "html过滤工具", "正则表达式语法速查", "the cat/"]

bm25_feature

Features

The BM25 (Best Matching) algorithm is a mainstream text matching algorithm in information retrieval, typically used for search relevance scoring. It first parses a query into terms . Then, for each search result D, it calculates the relevance score of each term for D. Finally, it calculates the final relevance score of the query for D as a weighted sum of the relevance scores for each term .

For Chinese, Query Tokenization serves as Morpheme Analysis, treating each Word (Term) as a Morpheme.

The general formula for the BM25 algorithm is:

In this formula, represents a query, is the -th term in the query, is a document, is the weight of , and R(qi,d) is the relevance score of to document .

Term importance

There are several methods for weighting a term's relevance to a document. A common method is Inverse Document Frequency (IDF). The formula is:

Where is the total number of documents in the corpus, and is the number of documents containing the term qi.

The definition of IDF shows that for a given Document Collection, the more documents that contain , the lower the weight of . In other words, if many documents contain , the Distinguishing Power of is low. Therefore, the importance of using to determine relevance is lower.

Term relevance

The relevance score between a term and a document , denoted as , has the following general form in the BM25 algorithm:

In this formula, are adjustment factors that are set based on experience. Typically, the values are . is the frequency of in document , and is the frequency of in the Query. is the length of document , and is the average length of all documents. Because appears only once in the query in most cases, , the formula can be simplified to:

The definition of shows that the parameter adjusts the impact of document length on relevance. The larger the value of , the greater the impact of document length on the relevance score, and vice versa. The longer the relative document length, the larger the value of , and the lower the relevance score. A longer document is more likely to contain . Therefore, for the same value, a long document has lower relevance to than a short document with .

In summary, the relevance score formula for the BM25 algorithm is as follows:

The BM25 formula provides significant flexibility in algorithm design, allowing for various methods of calculating search relevance scores based on different approaches to tokenization, term weighting, and term-document relevance.

Configuration

{
  "feature_type": "bm25_feature",
  "feature_name": "query_doc_relevance",
  "query": "user:query",
  "document": "item:title",
  "term_doc_freq_file": "term_doc_freq.txt",
  "avg_doc_length": 100.0,
  "k1": 1.2,
  "b": 0.75,
  "separator": "\u001D",
  "default_value": ""
}

Parameter

Required

Description

feature_name

Yes

The name of the output feature.

query

Yes

The source field for the query.

document

Yes

The source field for the document.

term_doc_freq_file

No

The file path to the term document frequency data. The file contains one term and its document count per line, in the format termdocument_count.

term_doc_freq_dict

No

An alternative to term_doc_freq_file, provided as a dictionary where each key is a term and its value is the document count.

k1

No

A parameter of the BM25 algorithm, typically between 1.2 and 2.0. Default: 1.2.

b

No

A parameter of the BM25 algorithm. Default: 0.75.

separator

No

A single-character separator for multi-valued input features. Default: \u001D.

normalizer

No

The normalization method. For details, see the raw_feature configuration.

default_value

No

The value to use when the input feature is empty.

stub_type

No

Default: false. If true, the system treats this feature transformation as an intermediate result and excludes it from the final model.

  • The term_doc_freq_file and term_doc_freq_dict parameters are mutually exclusive. If both are specified, term_doc_freq_file takes precedence.

  • When using this feature in an online service, place the term_doc_freq_file in the same directory as fg.json.

kv_dot_product

Overview

Computes the dot product of two key-value vectors or the size of the intersection of two sets.

Configuration

{
  "feature_type": "kv_dot_product",
  "feature_name": "query_doc_sim",
  "query": "user:query",
  "document": "item:title",
  "separator": "|",
  "default_value": "0"
}

Parameter

Required

Description

feature_name

Yes

The name of the output feature.

query

Yes

The source of the query field.

document

Yes

The source of the document field.

separator

No

The separator for multi-value input features. The default is \u001D. This must be a single character.

kv_delimiter

No

The separator between key-value pairs in the input feature. The default is :. This must be a single character.

normalizer

No

Specifies the normalization method. For details, see the configuration of the raw_feature operator.

default_value

No

Specifies the value to use if an input feature is empty.

stub_type

No

Defaults to false. If true, this feature transformation is used only as an intermediate result and is not output to the model.

  • This operator supports complex input types such as arrays and maps. Use complex types for optimal performance.

  • If an input entry does not have a value part, its value defaults to 1.0. This behavior can be used to calculate the size of the intersection between two sets.

  • If you do not configure default_value, the default value is set to 0.

Example

Query

Document

Output

"a:0.5|b:0.5"

"d:0.5|b:0.5"

0.25

["a:0.5", "b:0.5"]

["d:0.5", "b:0.5"]

0.25

{"a":0.5, "b":0.5}

{"d":0.5, "b":0.5}

0.25

["a:0.5", "b:0.5"]

{"d":0.5, "b:0.5}

0.25

["a", "b", "c"]

["a", "b", "d"]

2.0

["a", "b", "c"]

"a|b|d"

2.0

["a", "b", "c"]

{"a":0.5, "b":0.5}

1.0

str_replace_feature

Overview

The str_replace_feature operator replaces all matched substrings in an input string with their specified replacements.

Note: Overlapping matches are replaced greedily.

Configuration

{
  "feature_name": "norm_str",
  "feature_type": "str_replace_feature",
  "expression": ["user:query"],
  "default_value": "",
  "replacements": {
    "brown": "box",
    "dogs": "jugs",
    "fox": "with",
    "jumped": "five",
    "over": "dozen",
    "quick": "my",
    "the": "pack",
    "the lazy": "liquor",
    "|": "",
    "aa": "x",
    "a": "X"
  },
  "value_dimension": 1
}

Parameter

Description

feature_name

Required. Specifies the name of the output feature.

expression

Required. Specifies the source field that the feature depends on.

default_value

Optional. The default value for an empty input.

replacements

Optional. Required if replace_file is not set. A dictionary that maps original text to replacement text.

replace_file

Optional. This parameter is required if replacements is not set. The value is a dictionary file where each line contains an original text \t replacement text pair separated by a tab character (\t).

is_sequence

Optional. Specifies whether the input is a sequence feature. The default value is false.

sequence_length

Optional. Specifies the maximum length of the sequence. The operator truncates sequences that exceed this length.

sequence_delim

Optional. Specifies the delimiter for sequence elements. This parameter applies only to string inputs.

separator

Optional. This parameter applies only when is_sequence=true. It specifies the single-character separator for multi-value inputs. The default value is \u001D.

value_dimension

Optional. Specifies the dimension of the output feature. In offline tasks, this parameter is used to truncate the output. The default value is 0.

stub_type

Optional. When set to true, the operator uses the configured feature transformation only as an intermediate result in the pipeline and does not output it to the model. The default value is false.

  • You can configure both replace_file and replacements. Their replacement dictionaries are merged, and replacements has a higher priority.

  • This operator supports binning operations. For more information, see the Feature Binning (Discretization) documentation.

    • hash_bucket_size: Hashes the feature transformation result and performs a modulo operation.

    • vocab_list: Bins the input based on a vocabulary and maps the input to an index in the vocabulary.

    • vocab_dict: The binning result is the value in vocab_dict that corresponds to the feature value.

    • vocab_file: Reads the vocab_list or vocab_dict from a file.

  • This operator supports multi-value array inputs.

Example

The following table shows the execution results of the preceding configuration.

user:query

Output feature

the quick brown fox jumped over the lazy dogs

pack my box with five dozen liquor jugs

aaa

xX

Feature|Generation|Tool|is|very|useful

FeatureGenerationToolisveryuseful

regex_replace_feature

Overview

The regex_replace_feature operator is a feature transformation that replaces substrings matching a regular expression with a specified replacement string.

You can configure multiple patterns. Substrings that match any of the specified patterns are replaced.

Configuration

{
  "feature_name": "query",
  "feature_type": "regex_replace_feature",
  "expression": ["user:query"],
  "regex_pattern": "\\|",
  "replacement": " ",
  "default_value": ""
}

Parameter

Description

feature_name

Required. Name of the output feature.

expression

Required. The source field this feature depends on.

default_value

Optional. The default value to use when the input feature is empty.

regex_pattern

Required. The regular expression for matching the text to be replaced.

replacement

Optional. The replacement string. If this parameter is left empty, the matched text is removed.

replace_all

Optional. Specifies whether to perform a global replacement. The default value is true. If set to false, only the first match is replaced.

icase

Optional. Specifies whether regular expression matching is case-sensitive. The default value is false.

is_sequence

Optional. Specifies whether the feature is a sequence feature. The default value is false.

sequence_length

Optional. Specifies the maximum length of the sequence. Sequences longer than this value are truncated.

sequence_delim

Optional. Specifies the separator between sequence elements. This parameter applies only to string inputs.

separator

Optional. This parameter applies only when is_sequence=true. It specifies the separator for multi-valued inputs. The default value is \u001D. Only a single character is allowed.

value_dimension

Optional. In offline tasks, this parameter is used to truncate the output. The default value is 0.

stub_type

Optional. The default value is false. When set to true, the pipeline uses the configured feature transformation only as an intermediate result and does not output the result to the model.

  • This feature supports binning operations. For configuration details, see the Feature Binning (discretization) document:

    • hash_bucket_size: Hashes and applies a modulo operation to the feature transformation result.

    • vocab_list: Bins the input based on a vocabulary list and maps the input to an index in the list.

    • vocab_dict: Maps the feature value to a corresponding value in the vocab_dict dictionary.

    • vocab_file: Reads a vocab_list or vocab_dict from a file.

  • This feature supports multi-valued inputs in the form of an array.

Example

user:query

Output feature

China|People|Republic

China People Republic

Feature|Generation|Tool|Is great

Feature Generation Tool Is great

bool_mask_feature

Overview

Filters elements using a boolean value, similar to tf.boolean_mask(tensor, mask).

It is essentially a sequence feature.

Configuration

{
  "feature_name": "mask_feature",
  "feature_type": "bool_mask_feature",
  "value_type": "float",
  "expression": [
    "user:click_items",
    "item:is_valid"
  ],
  "sequence_delim": ","
}

Parameter

Description

feature_name

Required. Specifies the prefix for the output feature.

expression

Required. A list of source fields that this feature uses. The second element in the list is the mask.

default_value

Optional. The default value to use when the input feature is empty. If omitted, the default is 0 for numeric value_types.

value_type

Required. Specifies the data type of the output feature.

sequence_length

Optional. The maximum sequence length. Longer sequences are truncated.

sequence_delim

Optional. The separator for sequence elements. This parameter is only required for string inputs.

separator

Optional. The separator for multi-value inputs. Default: "\u001D". Must be a single character.

value_dimension

Optional. Default: 0. Used to truncate the output in offline tasks.

normalizer

Optional. Specifies the normalization method. This parameter applies only to numeric features. For more information, see RawFeature.

stub_type

Optional. Default: false. If set to true, the pipeline uses this feature transformation only as an intermediate result and does not output it to the model.

Examples

Input

Mask

Output

"123,456,90,80"

"true,false,true,false"

["123", "90"]

"123,456,90,80"

[1, 0, 1, 0]

["123", "90"]

[1, 2, 3, 4]

[1, 0, 1, 0]

[1, 3]

[1, 2, 3, 4]

"true,false,true,false"

[1, 3]

Usage with expression features

{
  "features": [
    {
      "feature_name": "mask",
      "feature_type": "expr_feature",
      "expression": "price>100",
      "variables": ["item:price"],
      "value_dimension": 3
    },
    {
      "feature_name": "filter_list",
      "feature_type": "bool_mask_feature",
      "expression": [
        "user:click_items",
        "feature:mask"
      ],
      "num_buckets": 10000
    }
  ]
}

slice_feature

Overview

This operator slices an input array using Python-style syntax or retrieves an element at a specific index.

Essentially, it is a sequence feature.

Configuration

{
  "feature_name": "test_feature",
  "feature_type": "slice_feature",
  "value_type": "float",
  "expression": [
    "user:click_items"
  ],
  "slice": "2:4"
}

Parameter

Required

Description

feature_name

Yes

The name of the output feature.

expression

Yes

The source field for the feature. The input must be a list.

slice

Yes

A single number specifies the element at the corresponding index of the input array, or you can use a slice string with the same syntax as Python in the format start:stop:step.

default_value

No

If an input feature is empty, the default value is used. If you do not explicitly provide a configuration, the default value is 0 when the value_type is a numeric type.

value_type

Yes

The data type of the output feature.

sequence_length

No

The maximum sequence length. Sequences longer than this are truncated.

sequence_delim

No

The separator for sequence elements. Required only if the input is a string.

separator

No

The separator for multi-value inputs. Defaults to \u001D. Only a single character is supported.

value_dimension

No

The output dimension. Defaults to 0. In offline tasks, this parameter can truncate the output.

normalizer

No

The normalization method. Applies only to numeric features. For details, see the raw_feature operator.

stub_type

No

Indicates if the feature is a stub. Defaults to false. If true, the feature acts as an intermediate result and is excluded from the model output.

placeholder

No

A special value in a sequence feature that is used to fill empty slots and pad dimensions. The default value for floating-point numbers is NaN. For integers, the default is the minimum value of the corresponding type. For more information, see the placeholder configuration item of the custom feature operator.

  • This operator supports binning. For configuration details, see Feature Binning (Discretization).

  • This operator supports multi-value inputs, including arrays and nested arrays.

Example

When you set sequence_delim="," and value_dimension=1, the input and output are as follows:

Input

slice

Output

"123,456,90,80"

0

"123"

"123,456,90,80"

2

"90"

"123,456,90,80"

1:3

["456", "90"]

[1, 2, 3, 4]

:2

[1, 2]

[1, 2, 3, 4]

2:

[3, 4]

[1, 2, 3, 4]

1:4:2

[2, 4]

[1, 2, 3, 4]

::-1

[4, 3, 2, 1]

[1, 2, 3, 4]

2:-1:-1

[3, 2, 1]

[1, 2, 3, 4]

:

[1, 2, 3, 4]