All Products
Search
Document Center

Artificial Intelligence Recommendation:Built-in feature operators

Last Updated:Jan 14, 2026

id_feature

Function introduction

The id_feature operator represents a discrete feature. It supports single-value discrete features, such as user IDs and item IDs, along with multi-value discrete features, such as product colors that can have multiple values.

Configuration

{
  "feature_type": "id_feature",
  "feature_name": "item_is_main",
  "expression": "item:is_main",
  "need_prefix": true,
  "separator": "\u001D",
  "default_value": ""
}

Field name

Required

Description

feature_name

Yes

The prefix for the output feature.

expression

Yes

The source field that the feature depends on.

need_prefix

No

Specifies whether to add the feature_name as a prefix. Valid values:

  • true: Concatenates.

  • false (default): Does not add the prefix.

value_type

No

The data type of the output feature. The default value is string.

separator

No

The separator for multi-value input features. The default value is \u001D. Only a single character is supported.

default_value

No

The default value to use when the input feature is empty.

weighted

No

Specifies whether the input is in key:value format. If you set this parameter to true, the operator outputs both the feature value and its weight as a Map.

value_dimension

No

This parameter truncates the output if a feature has multiple values. The default value is 0, which specifies no truncation.

If the value is 1, the schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.

stub_type

No

The default value is false. If you set this parameter to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model.

Example

The following examples use the item-side feature is_main to show feature input and output with different configurations:

Type

Value of item:is_main

Output feature

int64_t

100

item_is_main_100

double

5.2

item_is_main_5.2

string

abc

item_is_main_abc

Multi-value string

abc^]bcd

[item_is_main_abc, item_is_main_bcd]

Multi-value int

123^]456

[item_is_main_123, item_is_main_456]

The ^] symbol represents the multi-value separator. This is a single character with the ASCII code "\x1D", which can also be written as "\u001d".

raw_feature

Function introduction

The raw_feature operator represents a continuous feature. It supports numeric types such as int, float, and double, and handles both single-value and multi-value continuous features.

Configuration

{
 "feature_type" : "raw_feature",
 "feature_name" : "ctr",
 "expression" : "item:ctr",
 "normalizer" : "method=log10"
}

Field name

Required

Description

feature_name

Yes

The feature name.

expression

Yes

The source field that the feature depends on. The source must be user, item, or context.

normalizer

No

The normalization method. For more information, see the following sections.

value_type

No

The data type of the output feature. The default value is float.

separator

No

The separator for multi-value input features. The default value is "\u001D". Only a single character is supported.

default_value

No

The default value to use when the input feature is empty.

value_dimension

No

The dimension of the output field. The default value is 1. You can use this parameter to truncate the output in offline tasks. If the value is 1, the schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.

stub_type

No

The default value is false. If you set this parameter to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model.

Example

The ^] symbol represents the multi-value separator. Note that this is a single character with the ASCII code "\x1D", not two characters.

Type

Value of item:ctr

Output feature

int64_t

100

100

double

100.1

100.1

Multi-value int

123^]456

[123, 456] (The input field must have the same dimension as the configured dimension).

Normalizer

The raw_feature and match_feature operators support four normalization methods: minmax, zscore, log10, and expression. The configuration and calculation methods are as follows:

  • minmax

    Example configuration: method=minmax,min=2.1,max=2.2

    Formula: x = (x - min) / (max - min)

  • zscore

    Example configuration: method=zscore,mean=0.0,standard_deviation=10.0

    Formula: x = (x - mean) / standard_deviation

  • log10

    Example configuration: method=log10,threshold=1e-10,default=-10

    Formula: x = x > threshold ? log10(x) : default;

  • expression

    Example configuration: method=expression,expr=sign(x)

    Formula: You can configure any function or expression. The variable name is fixed as x, which represents the input of the expression.

expr_feature

Function introduction

The expr_feature operator represents an expression feature. It evaluates an expression and outputs a feature value of the float type. It supports batch computing and broadcasting.

Note: When you use this feature operator, all inputs must be convertible to the double type.

Configuration

{
  "feature_type" : "expr_feature",
  "feature_name" : "ctr_sigmoid",
  "expression" : "sigmoid(pv/(1+click))",
  "variables": ["item:pv", "item:click"]
}

If pv = 2, click = 3, the value of the expression feature is: 0.6224593312

Field name

Required

Description

feature_name

Yes

The feature name.

expression

Yes

Specifies the content of the expression.

variables

Yes

The variables used in the expression, which are the input fields. The source must be user, item, or context.

separator

No

The separator for multi-value input fields of the string type. The default value is "\u001D". Only a single character is supported.

default_value

No

The default value to use when the input feature is empty.

value_dimension

No

The dimension of the output field. The default value is 0. You can use this to truncate or pad the output. If the value is 1, the schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.

stub_type

No

The default value is false. If you set this parameter to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model.

Configuration example

{
    "feature_name": "expr_feat",
    "feature_type": "expr_feature",
    "expression": "a+b",
    "variables": ["a", "b"],
    "value_dimension": 3
}
  • Scalar and vector calculations (broadcasting)

    • If variable a=1 and variable b=[1, 2, 6], the result is [2, 3, 7].

  • Vector-to-vector element-wise computation

    • If variable a=[3, 2, 1] and variable b=[1, 2, 6], the result is [4, 4, 7].

  • Supports temporary variables and comma expressions

    • For example: x=roundp(a),(a-x)*b. In this example, x is a temporary variable and does not need to be configured in variables.

    • A comma expression is evaluated from left to right. The value of the rightmost sub-expression is returned.

    • To reduce memory overhead, reuse existing variables as temporary variables where semantically appropriate.

Combining expression features and sequence features

{
  "features": [
    {
      "feature_name": "sphere_distance",
      "feature_type": "expr_feature",
      "expression": "sphere_dist(click_id_lng,click_id_lat,j_lng,j_lat)",
      "variables": ["user:click_id_lng", "user:click_id_lat", "item:j_lng", "item:j_lat"],
      "default_value": "0",
      "value_dimension": 3,
      "stub_type": true
    },
    {
      "feature_name": "time_diff",
      "feature_type": "expr_feature",
      "variables": ["user:cur_time", "user:clk_time_seq"],
      "expression": "cur_time-clk_time_seq",
      "default_value": "0",
      "separator": ";",
      "value_dimension": 3,
      "stub_type": true
    },
    {
      "sequence_name": "click_seq",
      "sequence_length": 3,
      "sequence_delim": ";",
      "sequence_pk": "user:click_item",
      "features": [
        {
          "feature_name": "spherical_distance",
          "feature_type": "raw_feature",
          "expression": "feature:sphere_distance",
          "default_value": "0.0"
        },
        {
          "feature_name": "time_diff_seq",
          "feature_type": "id_feature",
          "expression": "feature:time_diff",
          "default_value": "0.0",
          "num_buckets": 10000
        }
      ]
    }
  ]
}

Expressions

  • Built-in functions (scalar)

    Function name

    Number of parameters

    Description

    rnd

    0

    Generate a random number between 0 and 1

    sin

    1

    sine function

    cos

    1

    cosine function

    tan

    1

    tangens function

    asin

    1

    arcus sine function

    acos

    1

    arcus cosine function

    atan

    1

    arcus tangens function

    sinh

    1

    hyperbolic sine function

    cosh

    1

    hyperbolic cosine

    tanh

    1

    hyperbolic tangens function

    asinh

    1

    hyperbolic arcus sine function

    acosh

    1

    hyperbolic arcus tangens function

    atanh

    1

    hyperbolic arcur tangens function

    log2

    1

    logarithm to the base 2

    log10

    1

    logarithm to the base 10

    log

    1

    logarithm to base e (2.71828...)

    ln

    1

    logarithm to base e (2.71828...)

    exp

    1

    e raised to the power of x

    sqrt

    1

    square root of a value

    sign

    1

    sign function -1 if x<0; 1 if x>0

    abs

    1

    absolute value

    rint

    1

    round to nearest integer

    round

    1

    Rounds to the nearest integer. It always rounds half away from zero.

    roundp

    1

    Rounds to a custom precision. For example, roundp(3.14159,2)=3.14.

    mod

    2

    Modulo operation

    floor

    1

    Rounds down to the nearest integer.

    ceil

    1

    Rounds up to the nearest integer.

    trunc

    1

    Truncates to an integer by removing the decimal part.

    sigmoid

    1

    sigmoid function

    sphere_dist

    4

    sphere distance between two gps points, args(lng1, lat1, lng2, lat2)

    haversine

    4

    haversine distance between two gps points, args(lng1, lat1, lng2, lat2)

    min

    var.

    min of all arguments

    max

    var.

    max of all arguments

    sum

    var.

    sum of all arguments

    avg

    var.

    mean value of all arguments

    Note: The preceding built-in functions support batch computing and broadcasting.

  • Built-in vector operation functions

    Function name

    Number of parameters

    Description

    len

    1

    the length of a vector

    l2_norm

    1

    l2 normalize of a vector

    squared_norm

    1

    squared normalize of a vector

    dot

    2

    dot product of two vectors

    euclid_dist

    2

    euclidean distance between two vectors

    corr

    2

    Pearson Correlation Coefficient of two vectors

    std_dev

    1

    standard deviation of a vector, divide n

    pop_std_dev

    1

    population standard deviation of a vector, divide n-1

    variance

    1

    sample variance of a vector, divide n

    pop_variance

    1

    population variance of a vector, divide n-1

    reduce_min

    1

    reduce min of a vector

    reduce_max

    1

    reduce max of a vector

    reduce_sum

    1

    reduce sum of a vector

    reduce_mean

    1

    reduce mean of a vector

    reduce_prod

    1

    reduce product of a vector

    Note: If an expression contains the preceding built-in vector operation functions, other variables that are not vector function parameters must be scalars.

  • Built-in binary operators

    Operator

    Description

    Priority

    =

    assignement *

    0

    ||

    logical or

    1

    &&

    logical and

    2

    |

    bitwise or

    3

    &

    bitwise and

    4

    <=

    less or equal

    5

    >=

    greater or equal

    5

    !=

    not equal

    5

    ==

    equal

    5

    >

    greater than

    5

    <

    less than

    5

    +

    addition

    6

    -

    subtraction

    6

    *

    multiplication

    7

    /

    division

    7

    %

    modulo

    7

    ^

    raise x to the power of y

    8

    The assignment operator is special because it modifies one of its arguments and can only be applied to variables.

  • Built-in ternary operator

    Supports if-else syntax.

    It uses lazy evaluation to ensure only the necessary branch of the expression is evaluated.

    Operator

    Description

    Priority

    ?:

    if then else operator

    C++ style syntax

  • Built-in constants

    Operator

    Description

    Priority

    _pi

    The one and only pi.

    3.141592653589793238462643

    _e

    Euler's number.

    2.718281828459045235360287

combo_feature

Function introduction

The combo_feature operator creates a combination (Cartesian product) of multiple fields or expressions. The id_feature operator can be considered a special case of combo_feature where only one field is used for the cross-product. Typically, the fields involved in the cross-product come from different tables, such as crossing a user feature with an item feature.

Configuration

{
  "feature_type" : "combo_feature",
  "feature_name" : "comb_age_item",
  "expression" : ["user:age_class", "item:item_id"],
  "need_prefix": true,
  "separator": "\u001D",
  "default_value": ""
}

Field name

Required

Description

feature_name

Yes

The prefix for the output feature.

expression

Yes

A list that specifies the source fields that the feature depends on.

need_prefix

No

Specifies whether to add the feature_name as a prefix. Valid values:

  • true: Concatenates the values.

  • false (default): Does not add the prefix.

value_type

No

The data type of the output feature. The default value is string.

separator

No

The separator for multi-value input features. The default value is "\u001D". Only a single character is supported.

default_value

No

The default value to use when the input feature is empty.

value_dimension

No

The default value is 0. You can use this parameter to truncate the output in offline tasks. If the value is 1, the schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.

stub_type

No

The default value is false. If you set this parameter to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model.

Example

The ^] symbol represents the multi-value separator. Note that this is a single character with the ASCII code "\x1D", not two characters.

Value of user:age_class

Value of item:item_id

Output feature

123

45678

comb_age_item_123_45678

abc, bcd

45678

[comb_age_item_abc_45678, comb_age_item_bcd_45678]

abc, bcd

12345^]45678

[comb_age_item_abc_12345, comb_age_item_abc_45678, comb_age_item_bcd_12345, comb_age_item_bcd_45678]

The number of output features is:

|F1| * |F2| * ... * |Fn|

Fn refers to the number of values in the nth dependent field.

lookup_feature

Function introduction

Similar to match_feature, the lookup_feature operator matches and retrieves a required result from a set of key-value pairs.

The lookup_feature operator depends on two fields: map and key.

  • The map is a dictionary or a multi-value string (MultiString) field where each string is in a format such as "k1:v1".

  • The key can be a field of any type. If you have multiple keys, an array-type input is recommended. When a feature is generated, the value of the key is retrieved, transformed into the key-value type of the map, and then matched against the key-value pairs in the map field to obtain the final feature.

Configuration

{
  "feature_type": "lookup_feature",
  "feature_name": "item_match_item",
  "map": "item:item_attr",
  "key": "item:item_value",
  "need_discrete": true,
  "need_key": true
}

Field name

Required

Description

feature_name

Yes

The prefix for the output feature.

map

Yes

The content of the dictionary, which is a set of key-value pairs.

key

Yes

The key to look up in the dictionary.

value_type

No

The data type of the output feature. The default value is string.

separator

No

The separator for the multi-value key field of the string type. The default value is "\u001D". Only a single character is supported.

default_value

No

The default value to use when the input feature is empty.

need_prefix

No

Specifies whether to add the feature_name as a prefix. Valid values:

  • true: Concatenates the inputs.

  • false (default): Does not perform concatenation.

need_key

No

Specifies whether to add the key as a prefix. This parameter takes effect only when value_type is set to string. Valid values:

  • true: Enables concatenation.

  • false (default): Does not add the prefix.

normalizer

No

The normalization method. This parameter has the same meaning as the parameter of the same name for raw_feature.

combiner

No

Specifies how to merge multiple values that are retrieved by multiple keys. Valid values: sum (default), avg/mean, max, and min.

need_discrete

No

true: Does not execute the combiner and directly outputs multiple values. The default value is false.

value_dimension

No

Valid values:

  • 0 (default): The output can be truncated in offline tasks.

  • 1: The schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.

stub_type

No

The default value is false. If you set this parameter to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model.

  • This operator supports binning. For more information, see Feature binning (discretization).

  • The dictionary supports map type inputs, and the key supports array type inputs.

Example

For the preceding configuration, assume that for a specific document:

item_attr : "k1:v1^]k2:v2^]k3:v3"

The ^] symbol represents the multi-value separator. This is a single character with the ASCII code "\x1D", not two characters. You can enter this character in emacs by pressing C-q C-5, or in vim by pressing C-v C-5. Here, item_attr is a multi-value string.

When the map represents multiple key-value pairs, it is a multi-value string, not a single string.

item_value : "k2"

The result of the feature transform is item_match_item_k2_v2.

need_prefix == true

feature_name: fg
map: {"k1:123", "k2:234", "k3:3"}
key: {"k1"}
Result: feature={"fg_123"}

need_prefix == false

map: {"k1:123", "k2:234", "k3:3"}
key: {"k1"}
Result: feature={123}

Merging query results

If there are multiple keys, you can configure a combiner to combine the multiple retrieved values. Possible configurations are sum, mean, max, and min.

To use a combiner, set need_discrete to false. In this case, the value must be a numeric type or a string that can be converted to a numeric type.

match_feature

Function introduction

The match_feature operator is typically used to define matching relationships between features. It implements a two-level map matching process.

Configuration

The configuration file uses the JSON format.

{
  "feature_name": "user__l1_ctr_1",
  "feature_type": "match_feature",
  "category": "ALL",
  "need_discrete": false,
  "item": "item:category_level1",
  "user": "user:l1_ctr_1",
  "match_type": "hit"
}
  • user: A nested dictionary (nested_dict), which is a dict of dicts.

    • The user field uses a string to describe a two-level map.

    • The | character is the separator between items in the first-level map. The ^ character is the separator between the key and value in the first-level map.

    • The , character is the separator between items in the second-level map. The : character is the separator between the key and value in the second-level map.

  • category: The primary key, which is the key for looking up in the first-level map.

    ALL is a wildcard character that indicates that all keys at this level can be matched.

  • item: The secondary key, which is the key for looking up in the second-level map.

    ALL is a wildcard character that indicates that all keys at this level can be matched.

  • need_discrete

    • true: The model uses the feature name output by match_feature and ignores the feature value. The default value is false.

    • false: The model uses the feature value output by match_feature and ignores the feature name.

  • match_type

    • hit: Outputs the matched feature. The operator uses the value of category to search in the first-level map, and then uses the value of item to search in the second-level map to get a result. If you need only one level of matching instead of two, you can set the key in the first level of the map to ALL and also set the category parameter in the feature generation configuration to "ALL".

    • multihit: Allows the category and item fields to be set to the MATCH_WILDCARD option, which is "ALL", to match multiple values.

  • normalizer

    Optional. The normalization method. This parameter has the same meaning as the parameter of the same name for raw_feature. This parameter takes effect only when need_discrete=false.

  • show_category

    Specifies whether to add the category prefix to the query result. The default value is true when need_discrete=true and match_type=hit. Otherwise, the default value is false.

  • show_item

    Specifies whether to add the item prefix to the query result. The default value is true when need_discrete=true and match_type=hit. Otherwise, the default value is false.

  • value_type

    Optional. The data type of the output feature. The default value is string.

  • separator

    Optional. The separator for the multi-value key field of the string type. The default value is "\u001D". Only a single character is supported.

  • default_value

    Optional. The default value to use when the input feature is empty.

  • value_dimension

    Optional. The default value is 0. You can use this parameter to truncate the output in offline tasks. If the value is 1, the schema type of the output table is value_type. Otherwise, the schema type is array<value_type>.

  • stub_type

    Optional. The default value is false. If you set this parameter to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model.

Example

Example of a user-side feature (nested dict)

For example, a string such as 50011740^50011740:0.2,36806676:0.3,122572685:0.5|50006842^16788:0.1 is converted to a two-level map as follows:

{
  "50011740": {
    "50011740": 0.2,
    "36806676": 0.3,
    "122572685": 0.5
  },
  "50006842": {
    "16788": 0.1
  }
}

hit

Example configuration for a hit match type.

{
  "feature_name": "brand_hit",
  "feature_type": "match_feature",
  "category": "item:auction_root_category",
  "need_discrete": true,
  "item": "item:brand_id",
  "user": "user:user_brand_tags_hit",
  "match_type": "hit"
}

Assume the field values are as follows:

Field

Value

user_brand_tags_hit

50011740^107287172:0.2,36806676:0.3,122572685:0.5|50006842^16788816:0.1,10122:0.2,29889:0.3,30068:19

auction_root_category

50006842

brand_id

30068

  • If need_discrete=true, the operator first uses the auction_root_category value 50006842 to query user_brand_tags_hit, which returns 16788816:0.1,10122:0.2,29889:0.3,30068:19. Then, it uses 30068 to query this result, which returns the value 19. The final result is: brand_hit_50006842_30068_19.

  • If need_discrete=false, the result is: 19.0.

If you use only one level of matching, you need to change the value of category in the preceding configuration to ALL. Assume the field values are as follows:

Field

Value

user_brand_tags_hit

ALL^16788816:40,10122:40,29889:20,30068:20

brand_id

30068

  • If need_discrete=true, the result is: brand_hit_ALL_30068_20.

  • If need_discrete=false, the result is: 20.0.

In this case, you can also use the lookup_feature operator. The value format in user_brand_tags_hit needs to be changed to: "16788816:40^]10122:40^]29889:20^]30068:20". The '^]' symbol is the multi-value separator \u001d, which is a non-printable character.

The lookup_feature operator supports complex input types such as map and array, and therefore provides better performance.

overlap_feature

Function introduction

Outputs features that contain string and word matching information. For example, in a search scenario, it calculates whether a search query is contained in a product title.

Method

Description

query_common_ratio

Calculates the ratio of common terms between the query and the title to the total number of terms in the query.

The value is in the range [0, 1].

title_common_ratio

Calculates the ratio of common terms between the query and the title to the total number of terms in the title.

The value is in the range [0, 1].

is_contain

Calculates whether the entire query is contained in the title, preserving the order. Valid values:

  • 0: Not contained.

  • 1: Contained.

is_equal

Calculates whether the query is identical to the title. Valid values:

  • 0: Not identical.

  • 1: Identical.

index_of

Calculates the position of the first occurrence of the entire query in the title. Returns -1.0 if not found.

proximity_min_cover

Calculates the proximity of query terms in the title.

The value is in the range [0, length(title)]. A value of 0 indicates that there are terms that cannot be matched.

proximity_min_dist

Calculates the proximity of query terms in the title (minimum pairwise distance).

The value is in the range [0, length(title)+1]. A value of length(title)+1 indicates that no terms are matched.

proximity_max_dist

Calculates the proximity of query terms in the title (maximum pairwise distance).

The value is in the range [0, length(title)+1]. A value of length(title)+1 indicates that no terms are matched.

proximity_avg_dist

Calculates the proximity of query terms in the title (average pairwise distance).

The value is in the range [0, length(title)+1]. A value of length(title)+1 indicates that no terms are matched.

The calculation method for Term Proximity Measures features is described in the paper "An Exploration of Proximity Measures in Information Retrieval".

Assume the term sequence of the title (document) is: t1,t2,t1,t3,t5,t4,t2,t3,t4

  • MinCover is defined as the length of the shortest document segment that covers each query term at least once in a document.

  • MinDist (Minimum pair distance): Calculates the minimum of all pairwise minimum distances. For example, if Q=t1,t2,t3, then MinDist=min(1,2,3)=1.

  • MaxDist (Maximum pair distance): The opposite of MinDist. It finds the maximum value. For example, if Q=t1,t2,t3, then MaxDist=max(1,2,3)=3.

  • AveDist (Average pair distance): Calculates the average of all pairwise minimum distances. For example, if Q=t1,t2,t3, then AveDist=(1+2+3)/3=2.

Note that all aggregate operators (MinDist, MaxDist, and AveDist) are defined based on the pairwise distances between matching search query terms. When a document matches only one search query term, MinDist, AveDist, and MaxDist are all defined as the length of the document.

Configuration

{
  "feature_type" : "overlap_feature",
  "feature_name" : "is_contain",
  "query" : "user:attr1",
  "title" : "item:attr2",
  "method" : "is_contain",
  "separator" : " ",
  "normalizer" : ""
}

Field name

Required

Description

feature_type

Yes

The type of the feature.

feature_name

Yes

The prefix for the output feature.

query

Yes

The table that the query depends on. attr1 is a multi-value string.

title

Yes

The table that the title depends on. attr2 is a multi-value string.

method

Yes

Valid values include query_common_ratio, title_common_ratio, is_contain, and is_equal.

separator

-

The separator character in the input. If not specified, the default is chr(29).

normalizer

No

The normalization method. This parameter has the same meaning as the parameter of the same name for raw_feature.

stub_type

No

The default value is false. If you set this parameter to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model.

The output of an overlap feature is of the float type.

Example 1

The query is "high,high2,fiberglass,abc", and the title is "high,quality,fiberglass,tube,for,golf,bag".

method

feature

query_common_ratio

0.5

title_common_ratio

0.28

is_contain

0

is_equal

0

Example 2

method=index_of, and title is the cat sat on the mat.

query

feature

the cat

0

sat

2

the mat

4

cap

-1

gap

-1

sequence_feature

Function introduction

User historical behavior is an important feature. Historical behavior is typically a sequence, such as a click sequence or a purchase sequence. The entities that make up the sequence can be the items themselves or the properties of the items.

Configuration

For example, to process a user's click sequence with a length of 50, you can extract the item_id, price, and ts features for each sequence. Here, ts = request_time - event_time. The configuration is as follows:

{
  "sequence_name": "click_50_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "sequence_pk": "user:click_50_seq",
  "features": [
    {
        "feature_name": "item_id",
        "feature_type": "id_feature",
        "value_type": "string",
        "expression": "item:item_id"
    },
    {
        "feature_name": "price",
        "feature_type": "raw_feature",
        "expression": "item:price"
    },
    {
        "feature_name": "ts",
        "feature_type": "raw_feature",
        "expression": "user:ts"
    },
    {
      "feature_name": "time_diff_seq",
      "feature_type": "custom_feature",
      "operator_name": "SeqExpr",
      "operator_lib_file": "3rdparty/lib64/libseq_expr.so",
      "expression": ["user:cur_time", "user:clk_time_seq"],
      "formula": "cur_time - clk_time_seq",
      "sequence_fields": ["clk_time_seq"],
      "default_value": "0",
      "value_type": "double",
      "is_op_thread_safe": false,
      "value_dimension": 1
    }
  ]
}
  • sequence_name: The sequence name.

  • sequence_length: The maximum length of the sequence.

  • sequence_delim: The separator between elements in the sequence.

  • sequence_pk: The primary key of the sequence, such as user:click_50_seq. It stores the 50 most recent item IDs that the user clicked. The model inference service uses this field as a key to query side info.

    • The request parameters for the online inference service (EAS Processor) must include a feature whose key is the value of sequence_pk.

      • For example: click_50_seq: 5410233389955966;1832586 (The separator is the value configured for sequence_delim).

        • In the preceding example, the value of the click_50_seq feature is 5410233389955966;1832586.

    • Item-side sub-features of the sequence do not need to be passed to the model inference service in the request parameters.

      • The model inference service uses this field as a key to query the item's side info.

      • For example, in this configuration, the item_id, price features in the sequence feature do not need to be passed to the inference service in the request. Instead, they are read from the Processor's item cache and concatenated by the feature generation (FG) SDK within the Processor to ensure the format is consistent with that used during offline training.

    • User-side sub-features of the sequence must be passed to the model inference service in the request parameters.

      • The feature name is ${sequence_name}__${input_name}, such as click_50_seq__ts.

      • ${input_name} is generally configured using the expression configuration item, but the configuration may vary for different sub-feature types, and ${input_name} does not include an input domain prefix (item: or user:).

  • features: The side info of the sequence, which includes static property values of the item and behavior time information.

    • sequence_fields: Specifies the field names of the input sequence. The value is a string or a [string] array.

      • When the feature operator has only one input field, the content of that field must be a sequence. In this case, you do not need to configure sequence_fields.

      • When the feature operator has multiple input fields, if you do not configure sequence_fields, all item-side features (item:XXX) are assumed to be sequence input fields.

    • The FG input table used in offline tasks must contain columns corresponding to all sub-features.

      • When a column is a sequence (based on the sequence_fields rules), name it ${sequence_name}__${input_name}.

        • For example, in this configuration, the offline table requires four columns: click_50_seq__item_id, click_50_seq__price, click_50_seq__ts, and click_50_seq__clk_time_seq.

        • We recommend that the column type in the offline table be an array for better performance. A string type with sequence_delim as the element separator is also supported.

      • When a column is not a sequence, name it ${input_name} without a prefix.

        • For example, in this configuration, the offline table requires one non-sequence column: ${cur_time}.

      • You can configure input_alias globally to set a shorter alias for a long column name (see the example below).

    • This operator supports binning. For more information, see Feature binning (discretization). When binning is configured, the output element type is int64, and the shape is determined by the value_dimension configuration below.

    • value_dimension (can be abbreviated as value_dim): The dimension of each element in the sequence. For sequence_raw_feature, if this is set to 1, the output type is array<float>. If set to other values, the output type is array<array<float>>. For sequence_id_feature, if this is set to 1, the output type is array<string>. If set to other values, the output type is array<array<string>>. The default value is 0.

Any feature can be configured as a sub-feature of a sequence feature, as shown in the following example:

{
  "features": [
    {
      "sequence_name": "common_seq",
      "sequence_length": 50,
      "sequence_delim": ";",
      "sequence_pk": "user:click_50_seq",
      "features": [
        {
          "feature_name": "item_id",
          "feature_type": "id_feature",
          "value_type": "String",
          "expression": "item:item_id",
          "value_dimension": 1
        },
        {
          "feature_name": "price",
          "feature_type": "raw_feature",
          "expression": "item:price"
        },
        {
          "feature_name": "ts",
          "feature_type": "raw_feature",
          "expression": "user:ts"
        },
        {
          "feature_name": "expr_feat",
          "feature_type": "expr_feature",
          "expression": "a > b",
          "variables": ["item:a", "item:b"],
          "sequence_fields": "a",
          "default_value": "0",
          "value_dimension": 1
        },
        {
          "feature_name": "lookup_feat",
          "feature_type": "lookup_feature",
          "map": "user:dict",
          "key": "item:prop",
          "separator": ",",
          "default_value": "0",
          "value_type": "float",
          "combiner": "sum",
          "boundaries": [0.0, 0.15, 0.5]
        },
        {
          "feature_name": "match_feat",
          "feature_type": "match_feature",
          "user": "user:nested_dict",
          "category": "item:pkey",
          "item": "item:skey",
          "separator": "\u001D",
          "default_value": "0",
          "matchType": "hit",
          "value_type": "float",
          "value_dimension": 1
        },
        {
          "feature_name": "bm25_score",
          "feature_type": "bm25_feature",
          "separator": " ",
          "default_value": "0",
          "query": "user:query",
          "document": "item:document",
          "sequence_fields": "query",
          "document_number": 100,
          "avg_doc_length": 6,
          "term_doc_freq_dict": {
            "this": 30,
            "example": 10,
            "document": 15
          }
        },
        {
          "feature_name": "overlap_feat",
          "feature_type": "overlap_feature",
          "query": "user:query2",
          "title": "item:title2",
          "sequence_fields": "query2",
          "method": "index_of",
          "separator": " ",
          "default_value": "-1"
        },
        {
          "feature_type": "kv_dot_product",
          "feature_name": "query_doc_sim",
          "query": "user:query3",
          "document": "item:title",
          "sequence_fields": "query3",
          "separator": "|",
          "default_value": "0"
        },
        {
          "feature_name": "seg_feat",
          "feature_type": "tokenize_feature",
          "expression": "input_a",
          "default_value": "0",
          "output_type": "word",
          "tokenizer_type": "sentencepiece",
          "vocab_file": "spmodel.model"
        },
        {
          "feature_name": "txt_norm",
          "feature_type": "text_normalizer",
          "expression": "input",
          "default_value": "<oov>",
          "parameter": 28
        },
        {
          "feature_name": "seq_combo_feat",
          "feature_type": "combo_feature",
          "expression": ["user:tags", "item:cat"],
          "sequence_fields": ["tags"],
          "separator": "_",
          "default_value": "0",
          "value_dimension": 1
        },
        {
          "feature_name": "norm_str",
          "feature_type": "str_replace_feature",
          "expression": ["user:profile"],
          "default_value": "",
          "replace_file": "synonyms.txt",
          "replacements": {
            "|": "",
            "aa": "x",
            "a": "X"
          },
          "value_dimension": 1
        },
        {
          "feature_name": "query_tokens",
          "feature_type": "regex_replace_feature",
          "expression": ["user:query_tokens"],
          "default_value": "",
          "value_type": "string",
          "regex_pattern": [ "\\|", "#", "\\(.*\\)" ],
          "replacement": "",
          "value_dimension": 1
        },
        {
          "feature_name": "slice",
          "feature_type": "slice_feature",
          "value_type": "int32",
          "expression": ["context:array"],
          "slice": "0:3",
          "value_dimension": 3,
          "num_buckets": 100000
        },
        {
          "feature_name": "mask_feature",
          "feature_type": "bool_mask_feature",
          "value_type": "float",
          "expression": [
            "user:click_items",
            "item:is_valid"
          ]
        },
        {
          "feature_name": "time_diff_seq",
          "feature_type": "custom_feature",
          "operator_name": "SeqExpr",
          "operator_lib_file": "3rdparty/lib64/libseq_expr.so",
          "expression": ["user:cur_time", "user:clk_time_seq"],
          "formula": "cur_time - clk_time_seq",
          "sequence_fields": ["clk_time_seq"],
          "default_value": "0",
          "value_type": "double",
          "is_op_thread_safe": false,
          "value_dimension": 1
        }
      ]
    }
  ],
  "input_alias": {
    "common_seq__clk_time_seq": "clk_time_seq"
  }
}

Note: The input_alias parameter configures aliases for input fields. The format is "origin_field": "alias_field". You can use a shorter name to replace the original input field name.

Tiled configuration

In most cases, you can create a sequence version of a non-sequence feature by adding the sequence_ prefix to its feature_type. Note that you must typically configure a default_value for sequence versions of features.

Examples:

Special case 1: Some feature transform types have both sequence and non-sequence versions.

In this case, you can activate the corresponding version by setting is_sequence: true/false.

In this case, the feature_type configuration item does not need the sequence_ prefix.

Examples:

Special case 2: Some feature transform types only have a sequence version, not a non-sequence version.

In this case, the feature_type configuration item does not need the sequence_ prefix.

Examples:

For these two special cases, you can add the following optional configurations:

  • sequence_length: The maximum length of the sequence. Any excess will be truncated. The default value is -1, which means no truncation.

  • sequence_delim: The separator between elements in the sequence. The default value is ;.

Configuration example:

{
  "feature_name": "clk_seq__item_id",
  "feature_type": "sequence_id_feature",
  "sequence_name": "clk_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "expression": "item:clk_item_seq",
  "separator": "\u001D",
  "default_value": ""
},
{
  "feature_name": "clk_seq__item_price",
  "feature_type": "sequence_raw_feature",
  "sequence_name": "clk_seq",
  "sequence_length": 50,
  "sequence_delim": ";",
  "expression": "item:clk_item_prices",
  "separator": "\u001D",
  "default_value": "0"
},
{
  "feature_name": "test",
  "feature_type": "sequence_lookup_feature",
  "map": "user:prefer_tags",
  "key": "item:tags",
  "sequence_length": 2,
  "separator": ",",
  "default_value": "-1024",
  "value_type": "int32",
  "normalizer": "method=expression,expr=x+1",
  "combiner": "sum",
  "default_bucketize_value": 50,
  "num_buckets": 10000
},
{
  "feature_name": "test",
  "feature_type": "sequence_combo_feature",
  "separator": "_",
  "default_value": "0",
  "expression": ["user:f1", "item:f2"],
  "hash_bucket_size": 10000
}

In the preceding example, the input fields clk_item_seq and clk_item_prices must be sequences. They can be of the array type or the string type, with element values separated by the character configured in sequence_delim.

  • With this configuration method, the online service (Processor) does not query sideinfo. The user must provide the complete input.

  • The input field names for tiled sequence features remain the same as configured and are not prefixed with ${sequence_name}__.

Online feature generation

There are two ways to obtain behavior side info. One way is to retrieve side info from the item cache of the EasyRec Processor. It uses the field configured in sequence_pk as the primary key to find item property information from the item cache. The other way is for the user to provide the corresponding field values in the request. For example, the "ts" field in the preceding configuration means (request_time - event_time). Because this value changes with the request time, it must be obtained from the request:

user_features {
  key: "click_50_seq"
  value {
    string_feature: "9008721;34926279;22487529;73379;840804;911247;31999202;7421440;4911004;40866551"
  }
}

user_features {
  key: "click__ts"
  value {
    string_feature: "23;113;401363;401369;401375;401405;486678;486803;486922;486969"
  }
}

tokenize_feature

Function introduction

The tokenize_feature operator tokenizes an input string and returns the tokenized string or the token IDs. It supports tokenizer.json files from tokenize-cpp.

Tokenization dictionary format:

1. https://github.com/huggingface/tokenizers

2. https://github.com/mlc-ai/tokenizers-cpp

Configuration

{
    "feature_name": "title_token",
    "feature_type": "tokenize_feature",
    "expression": "item:title",
    "default_value": "",
    "vocab_file": "tokenizer.json",
    "tokenizer_type": "sentencepiece",
    "output_type": "word_id",
    "output_delim": ","
}

Field name

Required

Description

feature_name

Yes

The feature name.

expression

Yes

The source field that the feature depends on. The source must be user, item, or context.

vocab_file

Yes

The path to the vocabulary file.

default_value

-

The default value for the input string.

tokenizer_type

-

Optional. The tokenizer type. If you set this to `sentencepiece` or leave it unset, the JSON content of the vocab_file determines which Hugging Face tokenizer to use.

output_type

-

  • word_id: Outputs the ID.

  • word: Outputs the tokenized string.

output_delim

-

The separator for the output word_id or word. This is used only in offline tasks.

stub_type

No

Optional. The default value is false. If you set this parameter to true, the configured feature transform is used only as an intermediate result in the pipeline and is not output to the model.

Example

If output_type=word_id, the operator takes a string as input and outputs a comma-separated string of token IDs.

Type

item:title

Output feature

string

It is good today!

1147,310,1175,3063,2

Configuration file examples

File name

Tokenizer type

Download link

bert-base-chinese-vocab.json

WordPiece

Download link

tokenizer.json

BPE

Download link

spiece.model

sentencepiece

Download link

text_normalizer

Function introduction

Performs text normalization. Functions include case conversion, traditional to simplified Chinese conversion, full-width to half-width character conversion, special character filtering, GBK/UTF8 encoding conversion, and Chinese character splitting.

Configuration

{
    "feature_name": "txt_norm",
    "feature_type": "text_normalizer",
    "expression": "item:title",
    "stop_char_file": "stop_char.txt",
    "max_length": 256,
    "parameter": 0,
    "remove_space": false,
    "is_gbk_input": false,
    "is_gbk_output": false
}

Field name

Required

Description

feature_name

Yes

The feature name.

expression

Yes

The source field that the feature depends on. The source must be user, item, or context.

stop_char_file

No

The path to a file that stores the special characters to be deleted. This file must use GBK encoding. If not configured, the system's built-in list of special characters is used.

max_length

-

If the input text length exceeds this value, text normalization is not performed, and the input value is output as is.

remove_space

-

Specifies whether to remove spaces.

is_gbk_input

No

Specifies whether the input is GBK encoded. false indicates that the input is UTF-8 encoded.

is_gbk_output

No

Specifies whether to use GBK encoding for the output. false indicates that the output uses UTF-8 encoding.

parameter

-

Text normalization options.

default_value

No

The default value to use when the input feature is empty.

Note:

  • The stop_char_file file must use GBK encoding.

  • Each line in the stop_char_file file can contain only one character. Otherwise, filtering will fail.

Text normalization options

For the parameter, select one or more of the following numeric values and add them together.

For example, to convert uppercase to lowercase, full-width to half-width, traditional to simplified Chinese, and filter special characters, set parameter=4+8+16+32=60.

The default value of the parameter is 60.

#define __NORMALIZED_LOWER2UPPER__ 		2 			/*Convert lowercase to uppercase*/
#define __NORMALIZED_UPPER2LOWER__ 		4 			/*Convert uppercase to lowercase*/
#define __NORMALIZED_SBC2DBC__ 			8 			/*Convert full-width to half-width*/
#define __NORMALIZED_BIG52GBK__			16 			/*Convert traditional to simplified Chinese*/
#define __NORMALIZED_FILTER__ 			32 			/*Filter special characters*/
#define __NORMALIZED_SPLITCHARS__		512 		/*Split Chinese characters into single characters (space-separated)*/

Example

{
  "feature_name": "txt_norm",
  "feature_type": "text_normalizer",
  "expression": "input_a",
  "parameter": 28
}
  • inputs=["Regular expression code generator", "HTML filtering tool", "Regular expression syntax cheat sheet", "The Cat/"]

  • outputs=["regular expression code generator", "html filtering tool", "regular expression syntax cheat sheet", "the cat/"]

bm25_feature

Function introduction

The BM25 (Best Matching) algorithm is a mainstream text matching algorithm in the field of information retrieval, typically used for search relevance scoring. It parses a query into morphemes . Then, for each search result D, it calculates the relevance score of each morpheme with D. Finally, it calculates a weighted sum of the relevance scores of relative to D to obtain the relevance score between the query and D.

For Chinese text, you can tokenize the query as morpheme analysis, treating each word (term) as a morpheme .

The general formula for the BM25 algorithm is as follows:

Here, represents a query, represents the -th term in the query, represents a document, represents the weight of , and R(qi,d) represents the relevance score of to the document .

Term importance

There are multiple methods to determine the weight of a term's relevance to a document. A commonly used method is Inverse Document Frequency (IDF). The IDF formula is as follows:

Here, represents the total number of documents in the corpus, and represents the total number of documents in the corpus that contain qi.

The definition of IDF shows that for a given document collection, the more documents that contain , the lower its weight. In other words, when many documents contain , the term does not distinguish well between them. Therefore, is less important for determining relevance.

Term relevance

The relevance score between the term and the document is . The general form of the relevance score in BM25 is as follows:

In this formula, are adjustment factors. They are typically set based on experience, with common values of . is the frequency of in , and is the frequency of in the query. is the length of document , and is the average length of all documents. Because appears only once in most queries, . The formula can therefore be simplified as follows:

The definition of shows that the parameter adjusts the effect of document length on relevance. The larger the value of , the greater the effect of document length on the relevance score, and vice versa. A greater document length results in a larger value and a smaller relevance score. A longer document has a greater chance of containing . Therefore, for the same value, a long document is considered less relevant to than a short document is to .

In summary, the relevance score formula for the BM25 algorithm is as follows:

The BM25 formula shows that different search relevance score calculation methods can be derived using various methods for tokenization, term weighting, and determining the relevance between a term and a document. This provides significant flexibility for algorithm design.

Configuration method

{
  "feature_type": "bm25_feature",
  "feature_name": "query_doc_relevance",
  "query": "user:query",
  "document": "item:title",
  "term_doc_freq_file": "term_doc_freq.txt",
  "avg_doc_length": 100.0,
  "k1": 1.2,
  "b": 0.75,
  "separator": "\u001D",
  "default_value": ""
}

Field name

Required

Description

feature_name

Yes

The name of the final output feature.

query

Yes

The source of the query field that the feature uses.

document

Yes

The source of the document field that the feature uses.

term_doc_freq_file

No

The path to a file that contains terms and the number of documents that contain each term. Each line contains one record, with the term and its document count separated by a whitespace character.

term_doc_freq_dict

No

The content is the same as term_doc_freq_file but in a dictionary format. The key is the term and the value is the number of documents that contain the term.

k1

No

A parameter for the BM25 algorithm. The default value is 1.2. Typical values include 1.2 and 2.0.

b

No

A parameter for the BM25 algorithm. The default value is 0.75.

separator

No

The separator for a multi-valued input feature. The default is \u001D. The separator must be a single character.

normalizer

No

The normalization method. For more information, see the raw_feature configuration.

default_value

No

The default value to use if the input feature is empty.

stub_type

No

The default value is false. If set to true, the configured feature transformation is used only as an intermediate result in the pipeline and is not included in the final output to the model.

  • Use either term_doc_freq_file or term_doc_freq_dict. The former has priority, and the system uses it if both are configured.

  • To use this feature for an online service, place the term_doc_freq_file file and fg.json in the same folder.

kv_dot_product

Function introduction

Calculates the dot product of two key-value index vectors or the size of the intersection of two sets.

Configuration method

{
  "feature_type": "kv_dot_product",
  "feature_name": "query_doc_sim",
  "query": "user:query",
  "document": "item:title",
  "separator": "|",
  "default_value": "0"
}

Field name

Required

Description

feature_name

Yes

The name of the output feature.

query

Yes

The source of the query field that the feature depends on.

document

Yes

The source of the document field that the feature depends on.

separator

No

Specifies the separator for multi-value input features. The default value is "\u001D". The separator must be a single character.

kv_delimiter

No

Specifies the separator between key-value pairs in the input feature. The default value is ":". The separator must be a single character.

normalizer

No

The normalization method. For more information, see the configuration of raw_feature.

default_value

No

The default value to use if the input feature is empty. The default value is 0.

stub_type

No

The default value is `false`. If this parameter is set to `true`, the configured feature transformation is used only as an intermediate result in the pipeline and is not output to the model.

  • This feature supports complex input types, such as array and map. Use complex types when possible.

  • If an input does not have a value part, the default value is 1.0. Use this behavior to calculate the size of the intersection of two sets.

  • If default_value is not configured, the default value is set to 0.

Examples

query

document

output

"a:0.5|b:0.5"

"d:0.5|b:0.5"

0.25

["a:0.5", "b:0.5"]

["d:0.5", "b:0.5"]

0.25

{"a":0.5, "b":0.5}

{"d":0.5, "b":0.5}

0.25

["a:0.5", "b:0.5"]

{"d":0.5, "b":0.5}

0.25

["a", "b", "c"]

["a", "b", "d"]

2.0

["a", "b", "c"]

"a|b|d"

2.0

["a", "b", "c"]

{"a":0.5, "b":0.5}

1.0

str_replace_feature(Click to expand for details)

Function introduction

The str_replace_feature is a string replacement feature. It replaces all matched substrings with specified substrings.

Overlapping matches are replaced greedily.

Configuration method

{
  "feature_name": "norm_str",
  "feature_type": "str_replace_feature",
  "expression": ["user:query"],
  "default_value": "",
  "replacements": {
    "brown": "box",
    "dogs": "jugs",
    "fox": "with",
    "jumped": "five",
    "over": "dozen",
    "quick": "my",
    "the": "pack",
    "the lazy": "liquor",
    "|": "",
    "aa": "x",
    "a": "X"
  },
  "value_dimension": 1
}

Field name

Description

feature_name

Required. The name of the output feature.

expression

Required. The expression describes the source field that the feature depends on.

default_value

Optional. The default value to use if the input feature is empty.

replacements

Optional. This parameter becomes required if replace_file is not set. A dictionary that maps original text to replacement text.

replace_file

Optional. This parameter becomes required if replacements is not set. A dictionary file where each line contains an original text \t replacement text pair. The separator is a tab character (\t).

is_sequence

Optional. Marks whether the feature is a sequence feature. The default value is false.

sequence_length

Optional. The maximum length of the sequence. The sequence is truncated if it exceeds this length.

sequence_delim

Optional. The separator between sequence elements. Set this only when the input is a string.

separator

Optional. This parameter is valid only when is_sequence=true. It specifies the separator for multi-value inputs. The default value is "\u001D". The separator must be a single character.

value_dimension

Optional. The default value is 0. This can be used in offline tasks to truncate the output.

stub_type

Optional. The default value is false. If set to true, the configured feature transformation is used only as an intermediate result in the pipeline and is not output to the model.

  • You can configure both replace_file and replacements. The replacement dictionaries are merged. The replacements dictionary has a higher priority.

  • Binning operations are supported. For configuration methods, see the Feature Binning (Discretization) operation documentation:

    • hash_bucket_size: Hashes the feature transformation result and performs a modulo operation.

    • vocab_list: Bins the input based on a vocabulary list and maps the input to an index in the vocabulary.

    • vocab_dict: The binning result is the value from the vocab_dict dictionary that corresponds to the feature value.

    • vocab_file: Reads the vocab_list or vocab_dict from a file.

  • Multi-value inputs of the array type are supported.

Example

The execution results of the preceding configuration are as follows:

Value of user:query

Output feature

the quick brown fox jumped over the lazy dogs

pack my box with five dozen liquor jugs

aaa

xX

Feature|Generation|Tool|Very useful

Feature Generation Tool Very useful

regex_replace_feature(Click to expand for details)

Function introduction

The regex_replace_feature feature replaces matched substrings with a specified substring.

You can configure multiple patterns. Any substring that matches one of the patterns is replaced.

Configuration method

{
  "feature_name": "query",
  "feature_type": "regex_replace_feature",
  "expression": ["user:query"],
  "regex_pattern": "\\|",
  "replacement": " ",
  "default_value": ""
}

Field name

Description

feature_name

Required. The name of the output feature.

expression

Required. The expression describes the source field that the feature depends on.

default_value

Optional. The default value to use if the input feature is empty.

regex_pattern

Required. The regular expression. Matched text segments are replaced.

replacement

Optional. The replacement text. If this parameter is empty, the matched text segments are deleted.

replace_all

Optional. Specifies whether to perform a global replacement. The default value is true. If you set this parameter to false, only the first occurrence of the pattern is replaced.

icase

Optional. Specifies whether the regular expression matching is case-sensitive. The default value is false.

is_sequence

Optional. Marks whether the feature is a sequence feature. The default value is false.

sequence_length

Optional. The maximum length of the sequence. The sequence is truncated if it exceeds this value.

sequence_delim

Optional. The separator between sequence elements. Set this parameter only when the input is a string.

separator

Optional. This parameter is valid only when is_sequence=true. It specifies the separator for multi-valued inputs. The default value is "\u001D". Only a single character is supported.

value_dimension

Optional. The default value is 0. You can use this parameter in an offline task to truncate the output.

stub_type

Optional. The default value is false. If this parameter is set to true, the configured feature transformation is used only as an intermediate result in the pipeline and is not included in the final output to the model.

  • This feature supports binning operations. For more information about the configuration, see the Feature binning (discretization) document:

    • hash_bucket_size: Performs a hash and a modulo operation on the feature transformation result.

    • vocab_list: Bins data based on a vocabulary list and maps the input to an index in the list.

    • vocab_dict: The binning result is the value from the vocab_dict dictionary that corresponds to the feature value.

    • vocab_file: Reads a vocab_list or vocab_dict from a file.

  • This feature supports multi-valued input of the array type.

Examples

Value of user:query

Output feature

China|People|Republic

People's Republic of China

Feature|Generation|Tool|Is great

Feature Generation Tool Is great

bool_mask_feature (Click to expand for details)

Function introduction

Filters elements using a Boolean value. This is similar to tf.boolean_mask(tensor, mask).

This is a sequence feature.

Configuration

{
  "feature_name": "mask_feature",
  "feature_type": "bool_mask_feature",
  "value_type": "float",
  "expression": [
    "user:click_items",
    "item:is_valid"
  ],
  "sequence_delim": ","
}

Field name

Meaning

feature_name

Required. The `feature_name` is the prefix for the final output feature.

expression

Required. A list. The `expression` describes the source fields for the feature. The second field is the mask.

default_value

Optional. The default value to use when the input feature is empty. If this parameter is not set, the default value is 0 when value_type is a numeric type.

value_type

Required. The data type of the output feature.

sequence_length

Optional. The maximum length of the sequence. The sequence is truncated if it exceeds this length.

sequence_delim

Optional. The separator between sequence elements. This parameter is required only if the input is a string.

separator

Optional. The multi-value separator for the input. The default value is "\u001D". The separator must be a single character.

value_dimension

Optional. The default value is 0. This parameter is used in an offline task to truncate the output.

normalizer

Optional. The normalization method. This parameter applies only to numerical features. For more information, see RawFeature.

stub_type

Optional. The default value is false. If set to true, the feature transformation is used only as an intermediate result in the pipeline. It is not included in the final output to the model.

  • Supports binning operations. For more information about configuration, see Feature binning (discretization).

  • Supports multi-value inputs represented by array types and nested array types.

Examples

Input

Mask

Output

"123,456,90,80"

"true,false,true,false"

["123", "90"]

"123,456,90,80"

[1, 0, 1, 0]

["123", "90"]

[1, 2, 3, 4]

[1, 0, 1, 0]

[1, 3]

[1, 2, 3, 4]

"true,false,true,false"

[1, 3]

Use with expression features

{
  "features": [
    {
      "feature_name": "mask",
      "feature_type": "expr_feature",
      "expression": "price>100",
      "variables": ["item:price"],
      "value_dimension": 3
    },
    {
      "feature_name": "filter_list",
      "feature_type": "bool_mask_feature",
      "expression": [
        "user:click_items",
        "feature:mask"
      ],
      "num_buckets": 10000
    }
  ]
}

slice_feature

Function introduction

Slices an input array using Python-like syntax or gets an element from a specific index.

This is a type of sequence feature.

Configuration

{
  "feature_name": "test_feature",
  "feature_type": "slice_feature",
  "value_type": "float",
  "expression": [
    "user:click_items"
  ],
  "slice": "2:4"
}

Field name

Description

feature_name

Required. The feature_name is used as the prefix for the final output feature.

expression

Required. A list. The expression describes the source fields that the feature depends on. The second field represents the Mask.

default_value

Optional. The default value to use if the input feature is empty. If not specified, the default is 0 when value_type is a numeric type.

value_type

Required. Specifies the type of the output feature.

sequence_length

The maximum length of the sequence. If the length exceeds this value, the sequence is truncated.

sequence_delim

The separator between sequence elements. Set this parameter only if the input is a string.

separator

Optional. The separator for multi-value inputs. The default is "\u001D". Only a single character is allowed.

value_dimension

Optional. The default value is 0. This parameter can be used in an offline task to truncate the output.

normalizer

Optional. The normalization method. This is valid only for numerical features. For more information, see RawFeature.

stub_type

Optional. The default value is false. If set to true, the configured feature transformation is used only as an intermediate result in the pipeline and is not included in the final output to the model.

placeholder

A special value used in sequence features to fill empty positions and pad dimensions. The default value for floating-point numbers is NaN. The default for integers is the minimum value of the corresponding type. For more information, see the placeholder configuration item for custom feature operators.

Example

When sequence_delim="," and value_dimension=1, the inputs and outputs are as follows:

Input

slice

Output

"123,456,90,80"

0

"123"

"123,456,90,80"

2

"90"

"123,456,90,80"

1:3

["456", "90"]

[1, 2, 3, 4]

:2

[1, 2]

[1, 2, 3, 4]

2:

[3, 4]

[1, 2, 3, 4]

1:4:2

[2, 4]

[1, 2, 3, 4]

::-1

[4, 3, 2, 1]

[1, 2, 3, 4]

2:-1:-1

[3, 2, 1]