id_feature
Overview
The id_feature operator processes a discrete feature. It handles both a single-value discrete feature, such as a user ID, and a multi-value discrete feature, such as the colors available for an item.
Configuration
{
"feature_type": "id_feature",
"feature_name": "item_is_main",
"expression": "item:is_main",
"need_prefix": true,
"separator": "\u001D",
"default_value": ""
}Parameter | Required | Description |
feature_name | Yes | The name of the output feature. This name is also used as a prefix in the generated feature value. |
expression | Yes | The source field used to generate the feature. |
need_prefix | No | Specifies whether to prepend the
|
value_type | No | The data type of the output feature. The default is |
separator | No | The multi-value separator for the input feature. The default is |
default_value | No | The default value to use when the input feature is empty. |
weighted | No | Marks whether the input is in the key:value format. If set to |
value_dimension | No | Truncates the output when a feature has multiple values. The default value is If the value is |
stub_type | No | If set to |
This operator supports feature binning. For configuration details, see Feature binning (discretization).
This operator supports multi-value inputs of type
array.
Example
The following example shows the input and output for the item:is_main feature with different configurations.
Type | Value | Output feature |
int64_t | 100 | item_is_main_100 |
double | 5.2 | item_is_main_5.2 |
string | abc | item_is_main_abc |
Multi-value string | abc^]bcd | [item_is_main_abc, item_is_main_bcd] |
Multi-value int | 123^]456 | [item_is_main_123, item_is_main_456] |
The ^] symbol represents the multi-value separator. This is a single character with the ASCII code "\x1D", which can also be written as "\u001d".
raw_feature
Overview
The raw_feature operator processes a continuous feature. It supports numeric types such as int, float, and double, and handles both single-value and multi-value continuous features.
Configuration
{
"feature_type" : "raw_feature",
"feature_name" : "ctr",
"expression" : "item:ctr",
"normalizer" : "method=log10"
}Parameter | Required | Description |
feature_name | Yes | Specifies the feature name. |
expression | Yes | The source field the feature depends on. Valid sources are |
normalizer | No | The normalization method. For details, see the Normalizer section. |
value_type | No | The data type of the output feature. Default: |
separator | No | The separator for a multi-value input feature. The default is |
default_value | No | The default value for an empty input feature. |
value_dimension | No | The dimension of the output field. The default is |
stub_type | No | The default is |
This operator supports feature binning. For configuration details, see Feature binning (discretization).
This operator supports multi-value array inputs.
Example
^] represents the multi-value separator, which is a single character with the ASCII encoding "\x1D", not two characters.
Type | Value | Output feature |
int64_t | 100 | 100 |
double | 100.1 | 100.1 |
Multi-value int | 123^]456 | [123, 456] (The input field's dimension must match the one specified in the |
Normalizer
The raw_feature and match_feature parameters support four types of normalizers: minmax, zscore, log10, expression. The configuration and calculation methods are as follows:
minmax
Configuration example:
method=minmax,min=2.1,max=2.2formula:
x' = (x - min) / (max - min)zscore
Configuration example:
method=zscore,mean=0.0,standard_deviation=10.0formula:
x' = (x - mean) / standard_deviationlog10
Configuration example:
method=log10,threshold=1e-10,default=-10formula:
x' = log10(x)ifx > threshold; otherwise,x' = defaultexpression
Configuration example:
method=expression,expr=sign(x)formula: Lets you define a custom function or expression. The input value is represented by the variable
x.
expr_feature
Overview
The expr_feature operator evaluates a mathematical expression and returns the result as a feature value. This operator supports Batch Computing and Broadcasting.
Important: All inputs must be convertible to the double data type.
Configuration
{
"feature_type" : "expr_feature",
"feature_name" : "ctr_sigmoid",
"value_type": "float",
"expression" : "sigmoid(pv/(1+click))",
"variables": ["item:pv", "item:click"]
}When pv = 2, click = 3, the value of the preceding expression feature is 0.6224593312.
Parameter | Required | Description |
feature_name | Yes | Specifies the name of the output feature. |
expression | Yes | Specifies the mathematical expression to evaluate. |
variables | Yes | Specifies the variables, or input fields, used in the expression. The source for each variable must be |
value_type | No | Optional. Specifies the data type of the output feature. Valid values are |
separator | No | Optional. Specifies the separator for multi-valued |
default_value | No | Optional. Specifies the default value to use when an input feature is empty. |
value_dimension | No | The default value is 0, which represents the dimension of the output field and can be used to truncate or pad the output. The schema type of the output table is |
stub_type | No | Optional. If set to |
Examples
{
"feature_name": "expr_feat",
"feature_type": "expr_feature",
"value_type": "float",
"expression": "a+b",
"variables": ["a", "b"],
"value_dimension": 3
}Scalar and vector computation (Broadcasting)
When
a=1andb=[1, 2, 6], the result is[2, 3, 7].
Vector-to-vector
element-wisecomputationWhen
a=[3, 2, 1]andb=[1, 2, 6], the result is[4, 4, 7].
Temporary Variables and Comma Expressions
For example:
x=roundp(a),(a-x)*b. In this example,xis a temporary variable and does not need to be configured invariables.A comma expression is evaluated from left to right, and it returns the value of the rightmost sub-expression as the final result.
To reduce memory overhead, you can reuse existing variables as temporary variables when semantically appropriate.
Combine expression and sequence features
{
"features": [
{
"feature_name": "sphere_distance",
"feature_type": "expr_feature",
"expression": "sphere_dist(click_id_lng,click_id_lat,j_lng,j_lat)",
"variables": ["user:click_id_lng", "user:click_id_lat", "item:j_lng", "item:j_lat"],
"default_value": "0",
"value_dimension": 3,
"stub_type": true
},
{
"feature_name": "time_diff",
"feature_type": "expr_feature",
"variables": ["user:cur_time", "user:clk_time_seq"],
"expression": "cur_time-clk_time_seq",
"default_value": "0",
"separator": ";",
"value_dimension": 3,
"stub_type": true
},
{
"sequence_name": "click_seq",
"sequence_length": 3,
"sequence_delim": ";",
"sequence_pk": "user:click_item",
"features": [
{
"feature_name": "spherical_distance",
"feature_type": "raw_feature",
"expression": "feature:sphere_distance",
"default_value": "0.0"
},
{
"feature_name": "time_diff_seq",
"feature_type": "id_feature",
"expression": "feature:time_diff",
"default_value": "0.0",
"num_buckets": 10000
}
]
}
]
}Expressions
Built-in functions (scalar)
Function name
Number of parameters
Description
rnd
0
Generates a random number in the range [0, 1).
sin
1
Calculates the sine of a number.
cos
1
Calculates the cosine of a number.
tan
1
Calculates the tangent of a number.
asin
1
Calculates the arcsine of a number.
acos
1
Calculates the arccosine of a number.
atan
1
Calculates the arctangent of a number.
sinh
1
Calculates the hyperbolic sine of a number.
cosh
1
Calculates the hyperbolic cosine of a number.
tanh
1
Calculates the hyperbolic tangent of a number.
asinh
1
Calculates the inverse hyperbolic sine of a number.
acosh
1
Calculates the inverse hyperbolic cosine of a number.
atanh
1
Calculates the inverse hyperbolic tangent of a number.
log2
1
Calculates the base-2 logarithm of a number.
log10
1
Calculates the base-10 logarithm of a number.
log
1
Calculates the natural logarithm (base e) of a number.
ln
1
Calculates the natural logarithm (base e) of a number.
exp
1
Raises Euler's number (e) to the power of a number.
sqrt
1
Calculates the square root of a number.
sign
1
Returns the sign of a number: -1 for negative, 1 for positive, or 0 for zero.
abs
1
Calculates the absolute value of a number.
rint
1
Rounds a number to the nearest integer.
round
1
Rounds a number to the nearest integer using the 'round half away from zero' method.
roundp
2
Rounds a number to a specified precision. For example,
roundp(3.14159, 2)returns3.14.mod
2
Calculates the remainder of a division.
floor
1
Rounds a number down to the nearest integer.
ceil
1
Rounds a number up to the nearest integer.
trunc
1
Truncates a number to an integer by removing its fractional part.
sigmoid
1
Calculates the sigmoid of a number.
sphere_dist
4
Calculates the spherical distance between two GPS points. Arguments:
lng1,lat1,lng2,lat2.haversine
4
Calculates the Haversine distance between two GPS points. Arguments:
lng1,lat1,lng2,lat2.min
Variable
Returns the minimum value from a list of arguments.
max
Variable
Returns the maximum value from a list of arguments.
sum
Variable
Returns the sum of all arguments.
avg
Variable
Returns the average value of all arguments.
Note: These built-in functions support Batch Computing and Broadcasting.
Built-in vector operation functions
Function name
Number of parameters
Description
len
1
Returns the length (number of elements) of a vector.
l2_norm
1
Performs L2 normalization on a vector.
squared_norm
1
Calculates the squared L2 norm of a vector.
dot
2
Calculates the dot product of two vectors.
euclid_dist
2
Calculates the Euclidean distance between two vectors.
corr
2
Calculates the Pearson correlation coefficient between two vectors.
std_dev
1
Calculates the sample standard deviation of a vector (dividing by n-1).
pop_std_dev
1
Calculates the population standard deviation of a vector (dividing by n).
variance
1
Calculates the sample variance of a vector (dividing by n-1).
pop_variance
1
Calculates the population variance of a vector (dividing by n).
reduce_min
1
Returns the minimum value in a vector.
reduce_max
1
Returns the maximum value in a vector.
reduce_sum
1
Returns the sum of all elements in a vector.
reduce_mean
1
Returns the average value of all elements in a vector.
reduce_prod
1
Returns the product of all elements in a vector.
Note: If an expression includes a built-in vector operation function, all other variables in the expression must be scalars.
Built-in binary operators
Operator
Description
Priority
=
Assignment. This special operator modifies one of its arguments and applies only to variables.
0
||
Logical OR
1
&&
Logical AND
2
|
Bitwise OR
3
&
Bitwise AND
4
<=
Less than or equal to
5
>=
Greater than or equal to
5
!=
Not equal to
5
==
Equal to
5
>
Greater than
5
<
Less than
5
+
Addition
6
-
Subtraction
6
*
Multiplication
7
/
Division
7
%
Modulo
7
^
Raises x to the power of y
8
Built-in ternary operator
Supports if-then-else logic using C-style syntax.
It uses lazy evaluation, which means it evaluates only the necessary branch of the expression.
Operator
Description
Syntax
?:
If-then-else operator
condition ? value_if_true : value_if_falseBuilt-in constants
Constant
Description
Value
_pi
The mathematical constant pi (π).
3.141592653589793
_e
The mathematical constant e, also known as Euler's number.
2.718281828459045
combo_feature
Overview
The combo_feature operator creates a feature combination, or a Cartesian product, from multiple input Fields or expressions. This process is also known as feature crossing. You can think of the id_feature operator as a special case of combo_feature where only one Field is used for the crossing. Typically, the Fields involved in the crossing come from different data sources, such as when crossing a user feature with an item feature.
Configuration
{
"feature_type" : "combo_feature",
"feature_name" : "comb_age_item",
"expression" : ["user:age_class", "item:item_id"],
"need_prefix": true,
"separator": "\u001D",
"default_value": ""
}
Parameter | Required | Description |
feature_name | Yes | Specifies the prefix for the output feature. |
expression | Yes | An array that specifies the source Fields the feature depends on. |
need_prefix | No | Indicates whether to prepend the
|
value_type | No | Specifies the data type of the output feature. The default value is |
separator | No | Specifies the multi-value separator for input features. The default value is |
default_value | No | Specifies the default value to use when an input feature is empty. |
value_dimension | No | The default value is 0, which can be used in offline tasks to truncate the output. If the value is 1, the schema type of the output table is |
stub_type | No | The default value is |
This operator supports Feature binning. For more information, see Feature binning (discretization).
This operator supports multi-value inputs of the
arraytype.
Example
The ^] symbol represents the multi-value separator. This symbol is a single character with the ASCII code \x1D, not two separate characters.
user:age_class | item:item_id | Output feature |
123 | 45678 | comb_age_item_123_45678 |
abc, bcd | 45678 | [comb_age_item_abc_45678, comb_age_item_bcd_45678] |
abc, bcd | 12345^]45678 | [comb_age_item_abc_12345, comb_age_item_abc_45678, comb_age_item_bcd_12345, comb_age_item_bcd_45678] |
The number of output features is calculated as:
|F1| * |F2| * ... * |Fn|Where |Fn| represents the number of values in the nth input Field.
lookup_feature
Overview
The lookup_feature operator is similar to match_feature. It retrieves a value from a set of key-value pairs.
This operator requires the map and key parameters:
mapis a dictionary type or a field of the MultiString type, where each string has the format "k1:v1".The
keycan be a field of any type. An array-type input is recommended for multiple keys. To generate a feature, the value of the key is retrieved, converted to the key type of themap, and then matched against the key-value pairs in the map field to obtain the final feature.
Configuration
{
"feature_type": "lookup_feature",
"feature_name": "item_match_item",
"map": "item:item_attr",
"key": "item:item_value",
"need_discrete": true,
"need_key": true
}Parameter | Required | Description |
feature_name | Yes | Specifies the prefix for the output feature. |
map | Yes | Specifies the dictionary that contains the set of key-value pairs. |
key | Yes | Specifies the key to look up in the dictionary. |
value_type | No | Specifies the data type of the output feature. The default is |
separator | No | Specifies the multi-value separator for the |
default_value | No | Specifies the default value to use when the input |
need_prefix | No | Controls whether to prepend the
|
need_key | No | Controls whether to prepend the
|
normalizer | No | Specifies the normalization method. This parameter works like the |
combiner | No | Specifies the aggregation method to merge values retrieved from multiple keys. Valid values: |
need_discrete | No | Controls whether to return multiple values as a discrete array. If set to |
value_dimension | No | Specifies the dimension of the output feature. This parameter can be used to truncate the output in an offline task.
|
stub_type | No | If set to |
This operator supports Binning. For configuration instructions, see Feature Binning (Discretization).
The
mapparameter accepts a dictionary object, and thekeyparameter accepts an array.
Example
Based on the configuration above, assume the following input data:
item_attr : "k1:v1^]k2:v2^]k3:v3"^] represents the multi-value separator. It is a single character with the ASCII encoding "\x1D", not two characters. You can enter this character in emacs by pressing C-q C-5, and in vim by pressing C-v C-5. Here, item_attr is a multi-value string.
When using a string for the map parameter, multiple key-value pairs must be provided as a multi-value string, not a single string.
item_value : "k2"The feature transformation result is item_match_item_k2_v2.
Example with need_prefix set to true
feature_name: fg
map: {"k1:123", "k2:234", "k3:3"}
key: {"k1"}
Result: feature={"fg_123"}Example with need_prefix set to false
map: {"k1:123", "k2:234", "k3:3"}
key: {"k1"}
Result: feature={123}Combining results
If you provide multiple keys, you can configure the combiner parameter to merge the retrieved values. Valid aggregation methods include sum, mean, max, and min.
If you want to use a combiner, you must set need_discrete to false. In this case, the value must be a numeric type or a string that can be converted to a numeric value.
match_feature
Overview
The match_feature operator transforms features by looking up values in a two-level nested map.
Configuration
Configure this operator in JSON format.
{
"feature_name": "user__l1_ctr_1",
"feature_type": "match_feature",
"category": "ALL",
"need_discrete": false,
"item": "item:category_level1",
"user": "user:l1_ctr_1",
"match_type": "hit"
}user: The data source, which is a two-level nested map encoded as a string.|is the separator between items in the first-level map, and^is the separator between the key and value in the first-level map.,is the separator between items in the second-level map, and:is the separator between a key and its value.
category: The primary key for the first-level map lookup.ALLis a wildcard character that matches all key values at this level.item: The secondary key for the second-level map lookup.ALLis a wildcard character that matches all key values at this level.need_discretetrue: The operator returns a composite string of the feature name and keys. The model uses this string as the feature and ignores the matched value.false(default): The operator returns only the matched feature value. The model uses this value directly.
match_typehit: Returns a single matched feature. The operator queries the first-level map with thecategoryvalue, and then queries the resulting second-level map with theitemvalue to get a single result. For single-level matching, you can set the key in the first-level map toALLand also set thecategoryparameter toALL.multihit: Allows thecategoryanditemfields to use theALLwildcard, which can return multiple matched values.
normalizerOptional. The normalization method. It has the same meaning as the configuration with the same name in raw_feature and takes effect only when
need_discreate=false.show_categorySpecifies whether to prepend the
categoryprefix to the query result. Defaults to true whenneed_discrete=trueandmatch_type=hit, and false otherwise.show_itemSpecifies whether to add the
itemprefix to the query result. The default value is true whenneed_discrete=trueandmatch_type=hit. Otherwise, the default value is false.value_typeOptional. Specifies the data type of the output feature. The default value is
string.separatorOptional. Specifies the multi-value separator for the
keyfield of the string type, which defaults to "\u001D" and must be a single character.default_valueOptional. Specifies the default value to use when the input feature is empty.
value_dimensionOptional, with a default value of 0. This parameter can be used in offline tasks to truncate the output. If the value is 1, the schema type of the output table is
value_type. Otherwise, the schema type isarray<value_type>.stub_typeOptional. The default value is
false. If you set this parameter totrue, the pipeline uses the configured feature transformation only as an intermediate result and does not pass it to the model.
Examples
User feature: Nested dictionary
For example, the string 50011740^50011740:0.2,36806676:0.3,122572685:0.5|50006842^16788:0.1 is converted into a two-level map as follows:
{
"50011740": {
"50011740": 0.2,
"36806676": 0.3,
"122572685": 0.5
},
"50006842": {
"16788": 0.1
}
}hit match type
{
"feature_name": "brand_hit",
"feature_type": "match_feature",
"category": "item:auction_root_category",
"need_discrete": true,
"item": "item:brand_id",
"user": "user:user_brand_tags_hit",
"match_type": "hit"
}Assume the field values are as follows:
Parameter | Value |
user_brand_tags_hit | 50011740^107287172:0.2,36806676:0.3,122572685:0.5|50006842^16788816:0.1,10122:0.2,29889:0.3,30068:19 |
auction_root_category | 50006842 |
brand_id | 30068 |
When
need_discreteistrue, the operator first queriesuser_brand_tags_hitwith theauction_root_categoryvalue (50006842), which returns16788816:0.1,10122:0.2,29889:0.3,30068:19. It then queries that result with thebrand_id(30068) to get the value19. The final result isbrand_hit_50006842_30068_19.When
need_discreteisfalse, the result is19.0.
If you use only single-layer matching, you must change the value of category in the configuration above to ALL. Assume that the fields have the following values:
Parameter | Value |
user_brand_tags_hit | ALL^16788816:40,10122:40,29889:20,30068:20 |
brand_id | 30068 |
When
need_discreteistrue, the result isbrand_hit_ALL_30068_20.When
need_discreteisfalse, the result is20.0.
In this case, you can also use lookup_feature or user_brand_tags_hit, and their values must be in the format "16788816:40^]10122:40^]29889:20^]30068:20". '^]' is the multi-value separator, which is the non-printable character \u001d.
Because the lookup_feature operator supports complex input types like maps and arrays, it offers better performance.
overlap_feature
Overview
The overlap_feature operator calculates string matching metrics between two text inputs. For example, in search applications, you can use it to determine if a query is contained within a title.
Method | Description |
query_common_ratio | Calculates the ratio of common terms between the Returns a value in the range [0, 1]. |
title_common_ratio | Calculates the ratio of common terms between the Returns a value in the range [0, 1]. |
is_contain | Checks if the
|
is_equal | Checks if the
|
index_of | Returns the starting position of the entire |
proximity_min_cover | Calculates the proximity of The returned value is in the range [0, length(title)]. A value of 0 indicates that at least one term cannot be matched. |
proximity_min_dist | Calculates the proximity of The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found. |
proximity_max_dist | Calculates the proximity of The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found. |
proximity_avg_dist | Calculates the proximity of The returned value is in the range [0, length(title) + 1]. A value of length(title) + 1 indicates that no matching term pairs were found. |
The calculation methods for these term proximity measures are based on the paper "An Exploration of Proximity Measures in Information Retrieval".
Assume that the Term sequence of title(document) is: t1,t2,t1,t3,t5,t4,t2,t3,t4
MinCover is defined as the length of the shortest document segment that covers each
queryterm at least once.MinDist (Minimum pair distance): Calculates the minimum of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, then
MinDist = min(1, 2, 3) = 1.MaxDist (Maximum pair distance): The opposite of
MinDist. It calculates the maximum of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, thenMaxDist = max(1, 2, 3) = 3.AveDist (Average pair distance): Calculates the average of all pairwise distances. For example, if the pairwise distances are 1, 2, and 3, then
AveDist = (1 + 2 + 3) / 3 = 2.
Note that all aggregate operators (MinDist, MaxDist, and AveDist) are defined based on the pairwise distances between matching query terms. When a document matches only one query term, MinDist, AveDist, and MaxDist are all defined as the length of the document.
Configuration
{
"feature_type" : "overlap_feature",
"feature_name" : "is_contain",
"query" : "user:attr1",
"title" : "item:attr2",
"method" : "is_contain",
"separator" : " ",
"normalizer" : ""
}Parameter | Required | Description |
feature_type | Yes | The type of the feature. Must be |
feature_name | Yes | The prefix for the output feature name. |
query | Yes | The source field for the |
title | Yes | The source field for the |
method | Yes | The calculation method. Valid values include |
separator | No | The delimiter for the input. If you do not specify a value, the default is |
normalizer | No | The normalization method. This parameter has the same function as the |
stub_type | No | Defaults to |
The overlap_feature operator returns a value of type float.
Example 1
Given a query of "high,high2,fiberglass,abc" and a title of "high,quality,fiberglass,tube,for,golf,bag", the operator returns the following results:
Method | Value |
query_common_ratio | 0.5 |
title_common_ratio | 0.28 |
is_contain | 0.0 |
is_equal | 0.0 |
Example 2
method=index_of and title is the cat sat on the mat.
Query | Value |
the cat | 0.0 |
sat | 2.0 |
the mat | 4.0 |
cap | -1.0 |
gap | -1.0 |
sequence_feature
Overview
A user's behavior history is a critical feature. This history is typically represented as a Sequence, such as a click Sequence or purchase Sequence. The entities that form a Sequence can be the items themselves or their properties.
How to configure
For example, to process a user's click Sequence with a length of 50, you can extract the item_id, price, and ts features for each item in the Sequence. In this case, ts is calculated as request_time - event_time. The following example shows the configuration:
{
"sequence_name": "click_50_seq",
"sequence_length": 50,
"sequence_delim": ";",
"sequence_pk": "user:click_50_seq",
"features": [
{
"feature_name": "item_id",
"feature_type": "id_feature",
"value_type": "string",
"expression": "item:item_id"
},
{
"feature_name": "price",
"feature_type": "raw_feature",
"expression": "item:price"
},
{
"feature_name": "ts",
"feature_type": "raw_feature",
"expression": "user:ts"
},
{
"feature_name": "time_diff_seq",
"feature_type": "custom_feature",
"operator_name": "SeqExpr",
"operator_lib_file": "3rdparty/lib64/libseq_expr.so",
"expression": ["user:cur_time", "user:clk_time_seq"],
"formula": "cur_time - clk_time_seq",
"sequence_fields": ["clk_time_seq"],
"default_value": "0",
"value_type": "double",
"is_op_thread_safe": false,
"value_dimension": 1
}
]
}sequence_name: The name of the Sequence.sequence_length: The maximum length of the Sequence.sequence_delim: The separator between elements in the Sequence.sequence_pk: The sequence primary key. For example,user:click_50_seqstores the 50 most recent item IDs that a user clicked. The Model Inference Service uses this field as a key to queryside info.The request parameters for the Online Inference Service (EAS Processor) must include a feature whose key is the value of
sequence_pk.For example:
click_50_seq: 5410233389955966;1832586(the separator is the value of thesequence_delimconfiguration)In the example above, the value of the
click_50_seqfeature is 5410233389955966;1832586.
Item-side sub-features of the Sequence are not required in the request to the Model Inference Service.
The Model Inference Service uses this field as a key to query the item's
side info.For example, in this configuration, the
item_id, pricefeatures in the sequence feature are not passed to the inference service in the request. Instead, the Processor uses the fg SDK to retrieve and concatenate these features from its item cache. This ensures that the data format is consistent with the format used during offline training.
User-side sub-features of the Sequence are required in the request to the Model Inference Service.
The feature name is
${sequence_name}__${input_name}, for example:click_50_seq__ts.${input_name}is typically configured with theexpressionoption, but this may vary for different sub-feature types.${input_name}does not include aninput domainprefix, such asitem:oruser:.
features: The
side infoof a sequence, including information such as the static attribute values of an item and behavioral time information.sequence_fields: Specifies the field name of the input sequence. The value is a
stringor a[string]array.When the feature operator has only one input field, the content of that field must be a sequence. In this case, you do not need to configure
sequence_fields.If a feature operator has multiple input fields and you do not configure
sequence_fields, all item-side features (such asitem:XXX) are assumed to be sequence input fields.
The input table for offline training must contain all columns corresponding to the sub-features.
When column is a sequence (refer to the rules for
sequence_fields), it is named${sequence_name}__${input_name}.For example, in this sample configuration, the offline table requires four columns:
click_50_seq__item_id,click_50_seq__price,click_50_seq__ts, andclick_50_seq__clk_time_seq.The recommended type for a column in an offline table is the array type for better performance. The
stringtype that usessequence_delimas an element separator is also supported.
When the column is not a sequence, it is named
${input_name}without a prefix.For example, in this configuration, the offline table requires one non-sequence column:
${cur_time}
You can use the global configuration
input_aliasto set a shorter alias for a long column name (see the example below).
Supports binning operations. For the configuration method, see Feature Binning (Discretization). When binning is configured, the output element type is
int64, and the shape is determined by thevalue_dimensionconfiguration.value_dimension (also abbreviated as
value_dim): Specifies the dimension of each element in the Sequence. For asequence_raw_feature, the output type isarray<float>when this parameter is set to1, andarray<array<float>>for other values. For asequence_id_feature, the output type isarray<string>when this parameter is set to1, andarray<array<string>>for other values. The default value is 0.
You can configure any feature as a sub-feature of a Sequence Feature. The following example shows the configuration:
{
"features": [
{
"sequence_name": "common_seq",
"sequence_length": 50,
"sequence_delim": ";",
"sequence_pk": "user:click_50_seq",
"features": [
{
"feature_name": "item_id",
"feature_type": "id_feature",
"value_type": "String",
"expression": "item:item_id",
"value_dimension": 1
},
{
"feature_name": "price",
"feature_type": "raw_feature",
"expression": "item:price"
},
{
"feature_name": "ts",
"feature_type": "raw_feature",
"expression": "user:ts"
},
{
"feature_name": "expr_feat",
"feature_type": "expr_feature",
"expression": "a > b",
"variables": ["item:a", "item:b"],
"sequence_fields": "a",
"default_value": "0",
"value_dimension": 1
},
{
"feature_name": "lookup_feat",
"feature_type": "lookup_feature",
"map": "user:dict",
"key": "item:prop",
"separator": ",",
"default_value": "0",
"value_type": "float",
"combiner": "sum",
"boundaries": [0.0, 0.15, 0.5]
},
{
"feature_name": "match_feat",
"feature_type": "match_feature",
"user": "user:nested_dict",
"category": "item:pkey",
"item": "item:skey",
"separator": "\u001D",
"default_value": "0",
"matchType": "hit",
"value_type": "float",
"value_dimension": 1
},
{
"feature_name": "bm25_score",
"feature_type": "bm25_feature",
"separator": " ",
"default_value": "0",
"query": "user:query",
"document": "item:document",
"sequence_fields": "query",
"document_number": 100,
"avg_doc_length": 6,
"term_doc_freq_dict": {
"this": 30,
"example": 10,
"document": 15
}
},
{
"feature_name": "overlap_feat",
"feature_type": "overlap_feature",
"query": "user:query2",
"title": "item:title2",
"sequence_fields": "query2",
"method": "index_of",
"separator": " ",
"default_value": "-1"
},
{
"feature_type": "kv_dot_product",
"feature_name": "query_doc_sim",
"query": "user:query3",
"document": "item:title",
"sequence_fields": "query3",
"separator": "|",
"default_value": "0"
},
{
"feature_name": "seg_feat",
"feature_type": "tokenize_feature",
"expression": "input_a",
"default_value": "0",
"output_type": "word",
"tokenizer_type": "sentencepiece",
"vocab_file": "spmodel.model"
},
{
"feature_name": "txt_norm",
"feature_type": "text_normalizer",
"expression": "input",
"default_value": "",
"parameter": 28
},
{
"feature_name": "seq_combo_feat",
"feature_type": "combo_feature",
"expression": ["user:tags", "item:cat"],
"sequence_fields": ["tags"],
"separator": "_",
"default_value": "0",
"value_dimension": 1
},
{
"feature_name": "norm_str",
"feature_type": "str_replace_feature",
"expression": ["user:profile"],
"default_value": "",
"replace_file": "synonyms.txt",
"replacements": {
"|": "",
"aa": "x",
"a": "X"
},
"value_dimension": 1
},
{
"feature_name": "query_tokens",
"feature_type": "regex_replace_feature",
"expression": ["user:query_tokens"],
"default_value": "",
"value_type": "string",
"regex_pattern": [ "\\|", "#", "\\(.*\\)" ],
"replacement": "",
"value_dimension": 1
},
{
"feature_name": "slice",
"feature_type": "slice_feature",
"value_type": "int32",
"expression": ["context:array"],
"slice": "0:3",
"value_dimension": 3,
"num_buckets": 100000
},
{
"feature_name": "mask_feature",
"feature_type": "bool_mask_feature",
"value_type": "float",
"expression": [
"user:click_items",
"item:is_valid"
]
},
{
"feature_name": "time_diff_seq",
"feature_type": "custom_feature",
"operator_name": "SeqExpr",
"operator_lib_file": "3rdparty/lib64/libseq_expr.so",
"expression": ["user:cur_time", "user:clk_time_seq"],
"formula": "cur_time - clk_time_seq",
"sequence_fields": ["clk_time_seq"],
"default_value": "0",
"value_type": "double",
"is_op_thread_safe": false,
"value_dimension": 1
}
]
}
],
"input_alias": {
"common_seq__clk_time_seq": "clk_time_seq"
}
}Note: The input_alias parameter is used to configure an alias for an input field in the format "origin_field": "alias_field". This allows you to replace the original input field name with a shorter one.
Flattened configuration
Generally, you can create the sequence version by adding the sequence_ prefix to a non-sequence feature type (feature_type). Note that you must generally configure a default_value for sequence features.
Examples:
sequence_id_feature: The output value is of the
stringtype. If you need a different type, useslice_featureinstead.sequence_raw_feature: The output value type is
float. If you need other types, useslice_featureinstead.sequence_combine_feature: This Feature Operator only has a Sequence version.
Special case 1: Some feature transformation types have both Sequence and non-sequence versions.
You can activate the corresponding version by configuring is_sequence: true/false.
In this case, you do not need to add the sequence_ prefix to the feature_type parameter.
Examples:
Special case 2: Some feature transformation types only have a Sequence version.
In this case, the feature_type parameter does not require the sequence_ prefix.
Examples:
For these two special cases, you can add the following optional parameters:
sequence_length: The maximum length of the Sequence. Any excess elements are truncated. The default value is -1, which indicates no truncation.sequence_delim: The separator between sequence elements. The default value is
;.
The following example shows the configuration:
{
"feature_name": "clk_seq__item_id",
"feature_type": "sequence_id_feature",
"sequence_name": "clk_seq",
"sequence_length": 50,
"sequence_delim": ";",
"expression": "item:clk_item_seq",
"separator": "\u001D",
"default_value": ""
},
{
"feature_name": "clk_seq__item_price",
"feature_type": "sequence_raw_feature",
"sequence_name": "clk_seq",
"sequence_length": 50,
"sequence_delim": ";",
"expression": "item:clk_item_prices",
"separator": "\u001D",
"default_value": "0"
},
{
"feature_name": "test",
"feature_type": "sequence_lookup_feature",
"map": "user:prefer_tags",
"key": "item:tags",
"sequence_length": 2,
"separator": ",",
"default_value": "-1024",
"value_type": "int32",
"normalizer": "method=expression,expr=x+1",
"combiner": "sum",
"default_bucketize_value": 50,
"num_buckets": 10000
},
{
"feature_name": "test",
"feature_type": "sequence_combo_feature",
"separator": "_",
"default_value": "0",
"expression": ["user:f1", "item:f2"],
"hash_bucket_size": 10000
}In the example above, the input fields clk_item_seq and clk_item_prices must be a Sequence. This can be an array or a string whose elements are separated by the character specified by sequence_delim.
With this configuration, the Online Inference Service does not query
side info. You must provide the complete input in the request.The input field names for sequence features in a flat format remain the same as configured and are not prefixed with
${sequence_name}__.
Online feature generation
You can obtain behavior sideinfo in two ways. The first way is to retrieve it from the item cache of the EasyRec Processor, using the field specified in sequence_pk as the primary key to look up item properties. The second way is to provide the corresponding field values in the request. For example, the "ts" field in the preceding configuration is calculated as request_time - event_time (the recommendation request time minus the user behavior time). Because this value changes with the request time, it must be obtained from the request.
user_features {
key: "click_50_seq"
value {
string_feature: "9008721;34926279;22487529;73379;840804;911247;31999202;7421440;4911004;40866551"
}
}
user_features {
key: "click__ts"
value {
string_feature: "23;113;401363;401369;401375;401405;486678;486803;486922;486969"
}
}sequence_combine_feature
Introduction
The sequence_combine_feature operator combines the multiple values for each element in a sequence feature. It transforms a multi-value sequence into a single-value sequence by aggregating the multiple values of each element into a single value using a specified combiner.
Key capabilities
Multi-value combination: Combines the multiple values of each element in a sequence into a single value.
Flexible combination strategies: Supports multiple combination strategies, including
sum,mean,max,min, andcount.Value Map: Supports a value map to convert string identifiers to numeric values, which is useful for processing behavioral event sequences.
Dual separator support: Supports separate configurations for the sequence delimiter and the multi-value separator.
Configuration
Basic configuration (numeric combination)
{
"feature_name": "seq_combine_feat",
"feature_type": "sequence_combine_feature",
"expression": "user:behavior_seq",
"combiner": "sum",
"separator": "|",
"sequence_delim": ";"
}Configuration with Value Map (Behavioral Events)
{
"feature_name": "behavior_score",
"feature_type": "sequence_combine_feature",
"expression": "user:action_events",
"combiner": "sum",
"separator": "|",
"sequence_delim": ";",
"value_map": {
"expo": 1,
"click": 2,
"buy": 4
}
}The value map is applied first, followed by the combine operation.
Parameters
Parameter | Required | Description |
feature_name | Yes | The name of the output feature. |
feature_type | Yes | Specifies the feature type. Must be set to |
expression | Yes | The source of the input feature. |
combiner | No | The combination strategy. Possible values: |
value_map | No | A map for converting strings to numeric values. The value map is applied first, followed by the combine operation. |
separator | No | The multi-value separator. Default: |
sequence_delim | No | The sequence delimiter for string inputs. This parameter is not required for array inputs and defaults to an empty string. Only a single character is supported. |
default_value | No | The default value to use when the input is empty. |
stub_type | No | Default: |
Examples
Example 1: Basic numeric combination (sum)
Configuration:
{
"feature_name": "score_sum",
"feature_type": "sequence_combine_feature",
"expression": "user:scores",
"combiner": "sum",
"separator": ",",
"sequence_delim": ";"
}Input and output:
Input | Output | Description |
|
| The operator calculates |
|
| The operator calculates |
|
| The input is an array of strings. |
|
| The input is an array of arrays. |
Example 2: Behavioral Event Sequence (with Value Map)
Configuration:
{
"feature_name": "behavior_weight",
"feature_type": "sequence_combine_feature",
"expression": "user:actions",
"combiner": "sum",
"separator": "|",
"sequence_delim": ";",
"value_map": {
"expo": 1,
"click": 2,
"buy": 4
}
}Input and output:
Input | Output | Description |
|
| The operator calculates |
|
| The mapped value is |
|
| The operator calculates |
|
| The input string contains multiple records separated by ;. |
|
| The input array contains multiple records. |
tokenize_feature
Overview
The tokenize_feature operator tokenizes an input string. It returns either the tokenized string or the corresponding token IDs. This operator supports tokenizer.json files from the tokenize-cpp library.
For more information about the vocabulary file format, see these resources:
1. https://github.com/huggingface/tokenizers
2. https://github.com/mlc-ai/tokenizers-cpp
Configuration
{
"feature_name": "title_token",
"feature_type": "tokenize_feature",
"expression": "item:title",
"default_value": "",
"vocab_file": "tokenizer.json",
"tokenizer_type": "sentencepiece",
"output_type": "word_id",
"output_delim": ","
}
Parameter | Required | Description |
feature_name | Yes | The unique name for the output feature. |
expression | Yes | Specifies the source field that the feature depends on. The source must be user, item, or context. |
vocab_file | Yes | The path to the vocabulary file. |
default_value | No | The default value for the input string. |
tokenizer_type | No | The tokenizer type. Set this to 'sentencepiece' to use the SentencePiece tokenizer. If unspecified, the system determines the appropriate Hugging Face tokenizer based on the 'vocab_file' content. |
output_type | No |
|
output_delim | No | The separator for the |
stub_type | No | Defaults to |
Example
When output_type is word_id, the operator converts an input string into a comma-separated string of token IDs.
Type | item:title | Output feature |
string | It is good today! | 1147,310,1175,3063,2 |
Vocabulary file examples
File name | Tokenizer type | Download link |
bert-base-chinese-vocab.json | WordPiece | |
tokenizer.json | BPE | |
spiece.model | sentencepiece |
text_normalizer
Overview
The text_normalizer operator performs Text Normalization, including case conversion, Traditional-to-Simplified Chinese conversion, full-width to half-width character conversion, special character filtering, GBK and UTF-8 encoding conversion, and Chinese character splitting.
Configuration
{
"feature_name": "txt_norm",
"feature_type": "text_normalizer",
"expression": "item:title",
"stop_char_file": "stop_char.txt",
"max_length": 256,
"parameter": 0,
"remove_space": false,
"is_gbk_input": false,
"is_gbk_output": false
}
Parameter | Required | Description |
feature_name | Yes | The feature name. |
expression | Yes | The source field that the feature depends on. The source must be |
stop_char_file | No | Specifies the path to a file of special characters to remove. If omitted, the system uses its built-in list. |
max_length | No | If the input text length exceeds this value, the operator skips normalization and returns the original text. |
remove_space | No | Specifies whether to remove spaces. |
is_gbk_input | No | Specifies whether the input is GBK-encoded. If false, the operator assumes the input is UTF-8. |
is_gbk_output | No | Specifies whether the output is GBK-encoded. If false, the operator encodes the output as UTF-8. |
parameter | No | Text normalization options. |
default_value | No | The default value to use when the input feature is empty. |
Note:
The
stop_char_filemust use GBK encoding.Each line in the
stop_char_filemust contain only one character to ensure successful filtering.
Text normalization options
To configure the parameter field, sum the numeric values of the desired options from the list below.
For example, to convert uppercase to lowercase, full-width to half-width, Traditional to Simplified Chinese, and filter special characters, set parameter = 4 + 8 + 16 + 32 = 60.
The default value for the parameter is 60.
#define __NORMALIZED_LOWER2UPPER__ 2 /* Convert lowercase to uppercase. */
#define __NORMALIZED_UPPER2LOWER__ 4 /* Convert uppercase to lowercase. */
#define __NORMALIZED_SBC2DBC__ 8 /* Convert full-width to half-width characters. */
#define __NORMALIZED_BIG52GBK__ 16 /* Convert Traditional Chinese to Simplified Chinese. */
#define __NORMALIZED_FILTER__ 32 /* Filter special characters. */
#define __NORMALIZED_SPLITCHARS__ 512 /* Split Chinese characters into single characters, separated by spaces. */Example
{
"feature_name": "txt_norm",
"feature_type": "text_normalizer",
"expression": "input_a",
"parameter": 28
}Input: ["正則生成代碼", "Html過濾工具", "正則表達式語法速查", "The Cat/"]
Output: ["正则生成代码", "html过滤工具", "正则表达式语法速查", "the cat/"]
bm25_feature
Features
The BM25 (Best Matching) algorithm is a mainstream text matching algorithm in information retrieval, typically used for search relevance scoring. It first parses a query into terms
For Chinese, Query Tokenization serves as Morpheme Analysis, treating each Word (Term) as a Morpheme
The general formula for the BM25 algorithm is:
In this formula,
Term importance
There are several methods for weighting a term's relevance to a document. A common method is Inverse Document Frequency (IDF). The formula is:
Where
The definition of IDF shows that for a given Document Collection, the more documents that contain
Term relevance
The relevance score between a term
In this formula,
The definition of
In summary, the relevance score formula for the BM25 algorithm is as follows:
The BM25 formula provides significant flexibility in algorithm design, allowing for various methods of calculating search relevance scores based on different approaches to tokenization, term weighting, and term-document relevance.
Configuration
{
"feature_type": "bm25_feature",
"feature_name": "query_doc_relevance",
"query": "user:query",
"document": "item:title",
"term_doc_freq_file": "term_doc_freq.txt",
"avg_doc_length": 100.0,
"k1": 1.2,
"b": 0.75,
"separator": "\u001D",
"default_value": ""
}Parameter | Required | Description |
feature_name | Yes | The name of the output feature. |
query | Yes | The source field for the query. |
document | Yes | The source field for the document. |
term_doc_freq_file | No | The file path to the term document frequency data. The file contains one term and its document count per line, in the format |
term_doc_freq_dict | No | An alternative to |
k1 | No | A parameter of the BM25 algorithm, typically between 1.2 and 2.0. Default: 1.2. |
b | No | A parameter of the BM25 algorithm. Default: 0.75. |
separator | No | A single-character separator for multi-valued input features. Default: |
normalizer | No | The normalization method. For details, see the raw_feature configuration. |
default_value | No | The value to use when the input feature is empty. |
stub_type | No | Default: false. If |
The
term_doc_freq_fileandterm_doc_freq_dictparameters are mutually exclusive. If both are specified,term_doc_freq_filetakes precedence.When using this feature in an online service, place the
term_doc_freq_filein the same directory asfg.json.
kv_dot_product
Overview
Computes the dot product of two key-value vectors or the size of the intersection of two sets.
Configuration
{
"feature_type": "kv_dot_product",
"feature_name": "query_doc_sim",
"query": "user:query",
"document": "item:title",
"separator": "|",
"default_value": "0"
}Parameter | Required | Description |
feature_name | Yes | The name of the output feature. |
query | Yes | The source of the query field. |
document | Yes | The source of the document field. |
separator | No | The separator for multi-value input features. The default is |
kv_delimiter | No | The separator between key-value pairs in the input feature. The default is |
normalizer | No | Specifies the normalization method. For details, see the configuration of the raw_feature operator. |
default_value | No | Specifies the value to use if an input feature is empty. |
stub_type | No | Defaults to |
This operator supports complex input types such as arrays and maps. Use complex types for optimal performance.
If an input entry does not have a
valuepart, itsvaluedefaults to1.0. This behavior can be used to calculate the size of the intersection between two sets.If you do not configure
default_value, the default value is set to 0.
Example
Query | Document | Output |
"a:0.5|b:0.5" | "d:0.5|b:0.5" | 0.25 |
["a:0.5", "b:0.5"] | ["d:0.5", "b:0.5"] | 0.25 |
{"a":0.5, "b":0.5} | {"d":0.5, "b":0.5} | 0.25 |
["a:0.5", "b:0.5"] | {"d":0.5, "b:0.5} | 0.25 |
["a", "b", "c"] | ["a", "b", "d"] | 2.0 |
["a", "b", "c"] | "a|b|d" | 2.0 |
["a", "b", "c"] | {"a":0.5, "b":0.5} | 1.0 |
str_replace_feature
Overview
The str_replace_feature operator replaces all matched substrings in an input string with their specified replacements.
Note: Overlapping matches are replaced greedily.
Configuration
{
"feature_name": "norm_str",
"feature_type": "str_replace_feature",
"expression": ["user:query"],
"default_value": "",
"replacements": {
"brown": "box",
"dogs": "jugs",
"fox": "with",
"jumped": "five",
"over": "dozen",
"quick": "my",
"the": "pack",
"the lazy": "liquor",
"|": "",
"aa": "x",
"a": "X"
},
"value_dimension": 1
}Parameter | Description |
feature_name | Required. Specifies the name of the output feature. |
expression | Required. Specifies the source field that the feature depends on. |
default_value | Optional. The default value for an empty input. |
replacements | Optional. Required if |
replace_file | Optional. This parameter is required if |
is_sequence | Optional. Specifies whether the input is a sequence feature. The default value is |
sequence_length | Optional. Specifies the maximum length of the sequence. The operator truncates sequences that exceed this length. |
sequence_delim | Optional. Specifies the delimiter for sequence elements. This parameter applies only to string inputs. |
separator | Optional. This parameter applies only when |
value_dimension | Optional. Specifies the dimension of the output feature. In offline tasks, this parameter is used to truncate the output. The default value is |
stub_type | Optional. When set to |
You can configure both
replace_fileandreplacements. Their replacement dictionaries are merged, andreplacementshas a higher priority.This operator supports binning operations. For more information, see the Feature Binning (Discretization) documentation.
hash_bucket_size: Hashes the feature transformation result and performs a modulo operation.vocab_list: Bins the input based on a vocabulary and maps the input to an index in the vocabulary.vocab_dict: The binning result is the value invocab_dictthat corresponds to the feature value.vocab_file: Reads thevocab_listorvocab_dictfrom a file.
This operator supports multi-value array inputs.
Example
The following table shows the execution results of the preceding configuration.
user:query | Output feature |
the quick brown fox jumped over the lazy dogs | pack my box with five dozen liquor jugs |
aaa | xX |
Feature|Generation|Tool|is|very|useful | FeatureGenerationToolisveryuseful |
regex_replace_feature
Overview
The regex_replace_feature operator is a feature transformation that replaces substrings matching a regular expression with a specified replacement string.
You can configure multiple patterns. Substrings that match any of the specified patterns are replaced.
Configuration
{
"feature_name": "query",
"feature_type": "regex_replace_feature",
"expression": ["user:query"],
"regex_pattern": "\\|",
"replacement": " ",
"default_value": ""
}Parameter | Description |
feature_name | Required. Name of the output feature. |
expression | Required. The source field this feature depends on. |
default_value | Optional. The default value to use when the input feature is empty. |
regex_pattern | Required. The regular expression for matching the text to be replaced. |
replacement | Optional. The replacement string. If this parameter is left empty, the matched text is removed. |
replace_all | Optional. Specifies whether to perform a global replacement. The default value is |
icase | Optional. Specifies whether regular expression matching is case-sensitive. The default value is |
is_sequence | Optional. Specifies whether the feature is a sequence feature. The default value is |
sequence_length | Optional. Specifies the maximum length of the sequence. Sequences longer than this value are truncated. |
sequence_delim | Optional. Specifies the separator between sequence elements. This parameter applies only to string inputs. |
separator | Optional. This parameter applies only when |
value_dimension | Optional. In offline tasks, this parameter is used to truncate the output. The default value is |
stub_type | Optional. The default value is |
This feature supports binning operations. For configuration details, see the Feature Binning (discretization) document:
hash_bucket_size: Hashes and applies a modulo operation to the feature transformation result.vocab_list: Bins the input based on a vocabulary list and maps the input to an index in the list.vocab_dict: Maps the feature value to a corresponding value in thevocab_dictdictionary.vocab_file: Reads avocab_listorvocab_dictfrom a file.
This feature supports multi-valued inputs in the form of an array.
Example
user:query | Output feature |
China|People|Republic | China People Republic |
Feature|Generation|Tool|Is great | Feature Generation Tool Is great |
bool_mask_feature
Overview
Filters elements using a boolean value, similar to tf.boolean_mask(tensor, mask).
It is essentially a sequence feature.
Configuration
{
"feature_name": "mask_feature",
"feature_type": "bool_mask_feature",
"value_type": "float",
"expression": [
"user:click_items",
"item:is_valid"
],
"sequence_delim": ","
}Parameter | Description |
feature_name | Required. Specifies the prefix for the output feature. |
expression | Required. A list of source fields that this feature uses. The second element in the list is the mask. |
default_value | Optional. The default value to use when the input feature is empty. If omitted, the default is |
value_type | Required. Specifies the data type of the output feature. |
sequence_length | Optional. The maximum sequence length. Longer sequences are truncated. |
sequence_delim | Optional. The separator for sequence elements. This parameter is only required for string inputs. |
separator | Optional. The separator for multi-value inputs. Default: "\u001D". Must be a single character. |
value_dimension | Optional. Default: 0. Used to truncate the output in offline tasks. |
normalizer | Optional. Specifies the normalization method. This parameter applies only to numeric features. For more information, see RawFeature. |
stub_type | Optional. Default: false. If set to true, the pipeline uses this feature transformation only as an intermediate result and does not output it to the model. |
Supports binning. For configuration, see Feature binning (discretization).
Supports multi-value inputs that are arrays or nested arrays.
Examples
Input | Mask | Output |
"123,456,90,80" | "true,false,true,false" | ["123", "90"] |
"123,456,90,80" | [1, 0, 1, 0] | ["123", "90"] |
[1, 2, 3, 4] | [1, 0, 1, 0] | [1, 3] |
[1, 2, 3, 4] | "true,false,true,false" | [1, 3] |
Usage with expression features
{
"features": [
{
"feature_name": "mask",
"feature_type": "expr_feature",
"expression": "price>100",
"variables": ["item:price"],
"value_dimension": 3
},
{
"feature_name": "filter_list",
"feature_type": "bool_mask_feature",
"expression": [
"user:click_items",
"feature:mask"
],
"num_buckets": 10000
}
]
}slice_feature
Overview
This operator slices an input array using Python-style syntax or retrieves an element at a specific index.
Essentially, it is a sequence feature.
Configuration
{
"feature_name": "test_feature",
"feature_type": "slice_feature",
"value_type": "float",
"expression": [
"user:click_items"
],
"slice": "2:4"
}Parameter | Required | Description |
feature_name | Yes | The name of the output feature. |
expression | Yes | The source field for the feature. The input must be a list. |
slice | Yes | A single number specifies the element at the corresponding index of the input array, or you can use a slice string with the same syntax as Python in the format |
default_value | No | If an input feature is empty, the default value is used. If you do not explicitly provide a configuration, the default value is |
value_type | Yes | The data type of the output feature. |
sequence_length | No | The maximum sequence length. Sequences longer than this are truncated. |
sequence_delim | No | The separator for sequence elements. Required only if the input is a string. |
separator | No | The separator for multi-value inputs. Defaults to |
value_dimension | No | The output dimension. Defaults to |
normalizer | No | The normalization method. Applies only to numeric features. For details, see the |
stub_type | No | Indicates if the feature is a stub. Defaults to |
placeholder | No | A special value in a sequence feature that is used to fill empty slots and pad dimensions. The default value for floating-point numbers is |
This operator supports binning. For configuration details, see Feature Binning (Discretization).
This operator supports multi-value inputs, including arrays and nested arrays.
Example
When you set sequence_delim="," and value_dimension=1, the input and output are as follows:
Input | slice | Output |
"123,456,90,80" | 0 | "123" |
"123,456,90,80" | 2 | "90" |
"123,456,90,80" | 1:3 | ["456", "90"] |
[1, 2, 3, 4] | :2 | [1, 2] |
[1, 2, 3, 4] | 2: | [3, 4] |
[1, 2, 3, 4] | 1:4:2 | [2, 4] |
[1, 2, 3, 4] | ::-1 | [4, 3, 2, 1] |
[1, 2, 3, 4] | 2:-1:-1 | [3, 2, 1] |
[1, 2, 3, 4] | : | [1, 2, 3, 4] |