The expression configuration item is required for all features that are described in this topic except lookup features. This configuration item specifies the source of the field on which a feature depends. The prefix user: or arm: indicates that the field source is a user feature or item feature. For example, user:is_member indicates that the value of the is_member parameter is obtained from the user feature table indicated by the user_feature parameter. arm:author_id indicates that the value of the author_id parameter is obtained from the loaded item feature table. The default prefix is user:.
The share_weight configuration item is optional. When the hybrid LinUCB algorithm is used, features marked with share_weight share parameters between arms. The default value of the share_weight configuration item is false. When the hybrid LinUCB algorithm is used, you must set the share_weight configuration item of cross features to true.
ID feature
An ID feature is a sparse feature. The feature vector is generated in multi-hot encoding mode.
ID features support four configuration items: vocab_list, num_buckets, hash_bucket_size, and boundaries.
Sample code:
{
"FeatureConf": [
{
"feature_type": "id_feature",
"expression": "gender",
"vocab_list": ["M", "F"]
},
{
"feature_type": "id_feature",
"expression": "level",
"num_buckets": 51
},
{
"feature_type": "id_feature",
"expression": "familyid",
"hash_bucket_size": 200
},
{
"feature_type": "id_feature",
"expression": "is_member",
"num_buckets": 2
},
{
"feature_type": "id_feature",
"expression": "fans_num",
"boundaries": [1, 2, 3, 4, 7, 15, 30, 50, 120]
}
]
}Example of inputs and outputs (Note: ^] is a multi-value delimiter and is used as one symbol. The ASCII code of this symbol is "\x1D". You can also modify the default delimiter by setting the separator configuration item.)
Type | Feature value | Intermediate result |
int64_t | 100 | 100 |
double | 5.2 | 5 (The fractional part is truncated.) |
string | abc | abc |
Multi-value string | abc^]bcd | (abc, bcd) |
Multi-value integer | 123^]456 | (123, 456) |
The output feature is transformed into a real-valued multi-hot vector. The transformation method is determined by the following configuration items: vocab_list, num_buckets, hash_bucket_size, and boundaries.
Raw feature
A raw feature is a dense feature. Raw features support only the int, float, and double data types. For features of other data types, you can treat them as ID features.
Parameter | Description |
expression | The source of the field on which the feature depends. This parameter is required. |
separator | A multi-value delimiter. |
value_dimension | The output dimension. This parameter is optional. The default value is 1. |
normalizer | The normalization method. This parameter is optional. |
Sample code:
{
"FeatureConf": [
{
"feature_type": "raw_feature",
"expression": "userid_avg_hot_15",
"normalizer": "method=minmax,min=0,max=60"
},
{
"feature_type": "raw_feature",
"expression": "userid_avg_duration_15",
"normalizer": "method=log10"
}
]
}Configurations of normalizer
Raw features and lookup features support three normalization methods: min-max normalization (minmax), z-score normalization (zscore), and log10 normalization (log10). The following section provides configuration examples and calculation formulas.
log10
Configuration example: method=log10,threshold=1e-10,default=1e-10. Calculation formula: x = x > threshold ? log10(x) : default;
zscore
Configuration example: method=zscore,mean=0.0,standard_deviation=10.0. Calculation formula: x = (x - mean) / standard_deviation
minmax
Configuration example: method=minmax,min=2.1,max=2.2. Calculation formula: x = (x - min) / (max - min)
Combo feature
A combo feature is a combination generated from the Cartesian product of multiple fields or expressions. In most cases, fields that are involved in a cross are from different tables. For example, the fields in a user feature table and the fields in an item feature table are involved in a cross.
The combo feature vector is generated in one-hot encoding mode after feature values are combined.
Sample code:
{
"FeatureConf": [
{
"feature_type": "combo_feature",
"expression": ["user:age_class", "arm:item_id"],
"hash_bucket_size": 200,
"share_weight": true
},
{
"feature_type": "combo_feature",
"expression": ["user:age_class", "arm:level"],
"num_buckets": [5, 8],
"share_weight": true
}
]
}
Number of output feature values: |F1| |F2| ... * |Fn|, where Fn indicates the number of values of the nth field on which the feature depends.
If the hash_bucket_size parameter is specified, the combined feature values are hashed to buckets. The number of buckets is determined by the hash_bucket_size parameter.
Lookup feature
A lookup feature indicates a process that matches and retrieves the desired results from a set of key-value pairs.
This type of feature depends on the map and key fields. The map field is a multi-value field of the STRING type, with each string in the k1:v2 format. The key field can be of any type. To generate a lookup feature, extract the value of the key field and convert the value to a string. Then, use the extracted value for a match in the key-value pairs of the map field. This way, the final feature is generated.
The sources of the map and key fields can be any combinations of items, users, and context. A lookup feature supports only JSON-formatted configurations.
Sample code:
{
"FeatureConf": [
{
"feature_type": "lookup_feature",
"map": "user:userid_kv__author__click_cnt_15",
"key": "arm:userId",
"normalizer": "method=log10",
"share_weight": true
}
]
}Geohash feature
Geohash features are generated after the system converts latitudes and longitudes into strings of a specified length and then performs hash operations. The geohash algorithm divides a geographical location into several grids and assigns an encoded hash value to each grid.
Sample code:
{
"FeatureConf": [
{
"feature_type": "geohash_feature",
"expression": ["latitude", "longitude"],
"geohash_precision": 4,
"hash_bucket_size": 128
}
]
}The final geohash values are hashed to buckets. The number of buckets is determined by the hash_bucket_size parameter.
Binary feature
The feature value of a binary feature is 0 or 1. Binary features are suitable for describing certain user features such as the gender.
The feature value of a binary feature is generated by determining whether the input value is in the set specified by the vocab_list parameter. If the input value matches an element in the specified set, the feature value is 1. Otherwise, the feature value is 0.
Sample code:
{
"FeatureConf": [
{
"feature_type": "binary_feature",
"expression": "gender",
"vocab_list": ["M"]
}
]
}If the input value is 0 or 1, use a raw feature rather than a binary feature in the feature configuration.