PAI-Rec Feature Types & Encoding Parameters Explained - Artificial Intelligence Recommendation

The expression configuration item is required for all features that are described in this topic except lookup features. This configuration item specifies the source of the field on which a feature depends. The prefix user: or arm: indicates that the field source is a user feature or item feature. For example, user:is_member indicates that the value of the is_member parameter is obtained from the user feature table indicated by the user_feature parameter. arm:author_id indicates that the value of the author_id parameter is obtained from the loaded item feature table. The default prefix is user:.

The share_weight configuration item is optional. When the hybrid LinUCB algorithm is used, features marked with share_weight share parameters between arms. The default value of the share_weight configuration item is false. When the hybrid LinUCB algorithm is used, you must set the share_weight configuration item of cross features to true.

ID feature

An ID feature is a sparse feature. The feature vector is generated in multi-hot encoding mode.

ID features support four configuration items: vocab_list, num_buckets, hash_bucket_size, and boundaries.

Sample code:

{
  "FeatureConf": [
    {
      "feature_type": "id_feature",
      "expression": "gender",
      "vocab_list": ["M", "F"]
    },
    {
      "feature_type": "id_feature",
      "expression": "level",
      "num_buckets": 51
    },
    {
      "feature_type": "id_feature",
      "expression": "familyid",
      "hash_bucket_size": 200
    },
    {
      "feature_type": "id_feature",
      "expression": "is_member",
      "num_buckets": 2
    },
    {
      "feature_type": "id_feature",
      "expression": "fans_num",
      "boundaries": [1, 2, 3, 4, 7, 15, 30, 50, 120]
    }
  ]
}

Example of inputs and outputs (Note: ^] is a multi-value delimiter and is used as one symbol. The ASCII code of this symbol is "\x1D". You can also modify the default delimiter by setting the separator configuration item.)

Type	Feature value	Intermediate result
int64_t	100	100
double	5.2	5 (The fractional part is truncated.)
string	abc	abc
Multi-value string	abc^]bcd	(abc, bcd)
Multi-value integer	123^]456	(123, 456)

The output feature is transformed into a real-valued multi-hot vector. The transformation method is determined by the following configuration items: vocab_list, num_buckets, hash_bucket_size, and boundaries.

Raw feature

A raw feature is a dense feature. Raw features support only the int, float, and double data types. For features of other data types, you can treat them as ID features.

Parameter	Description
expression	The source of the field on which the feature depends. This parameter is required.
separator	A multi-value delimiter.
value_dimension	The output dimension. This parameter is optional. The default value is 1.
normalizer	The normalization method. This parameter is optional.

Sample code:

{
  "FeatureConf": [
    {
      "feature_type": "raw_feature",
      "expression": "userid_avg_hot_15",
      "normalizer": "method=minmax,min=0,max=60"
    },
    {
      "feature_type": "raw_feature",
      "expression": "userid_avg_duration_15",
      "normalizer": "method=log10"
    } 
  ]
}

Configurations of normalizer

Raw features and lookup features support three normalization methods: min-max normalization (minmax), z-score normalization (zscore), and log10 normalization (log10). The following section provides configuration examples and calculation formulas.

log10

Note

Configuration example: method=log10,threshold=1e-10,default=1e-10. Calculation formula: x = x > threshold ? log10(x) : default;

zscore

Note

Configuration example: method=zscore,mean=0.0,standard_deviation=10.0. Calculation formula: x = (x - mean) / standard_deviation

minmax

Note

Configuration example: method=minmax,min=2.1,max=2.2. Calculation formula: x = (x - min) / (max - min)

Combo feature

A combo feature is a combination generated from the Cartesian product of multiple fields or expressions. In most cases, fields that are involved in a cross are from different tables. For example, the fields in a user feature table and the fields in an item feature table are involved in a cross.

The combo feature vector is generated in one-hot encoding mode after feature values are combined.

Sample code:

{
  "FeatureConf": [
    {
      "feature_type": "combo_feature",
      "expression": ["user:age_class", "arm:item_id"],
      "hash_bucket_size": 200,
      "share_weight": true
    },
    {
      "feature_type": "combo_feature",
      "expression": ["user:age_class", "arm:level"],
      "num_buckets": [5, 8],
      "share_weight": true
    }
  ]
}

Note

Number of output feature values: |F1| |F2| ... * |Fn|, where Fn indicates the number of values of the nth field on which the feature depends.

If the hash_bucket_size parameter is specified, the combined feature values are hashed to buckets. The number of buckets is determined by the hash_bucket_size parameter.

Lookup feature

A lookup feature indicates a process that matches and retrieves the desired results from a set of key-value pairs.

This type of feature depends on the map and key fields. The map field is a multi-value field of the STRING type, with each string in the k1:v2 format. The key field can be of any type. To generate a lookup feature, extract the value of the key field and convert the value to a string. Then, use the extracted value for a match in the key-value pairs of the map field. This way, the final feature is generated.

The sources of the map and key fields can be any combinations of items, users, and context. A lookup feature supports only JSON-formatted configurations.

Sample code:

{
  "FeatureConf": [
    {
      "feature_type": "lookup_feature",
      "map": "user:userid_kv__author__click_cnt_15",
      "key": "arm:userId",
      "normalizer": "method=log10",
      "share_weight": true
    }
  ]
}

Geohash feature

Geohash features are generated after the system converts latitudes and longitudes into strings of a specified length and then performs hash operations. The geohash algorithm divides a geographical location into several grids and assigns an encoded hash value to each grid.

Sample code:

{
  "FeatureConf": [
    {
      "feature_type": "geohash_feature",
      "expression": ["latitude", "longitude"],
      "geohash_precision": 4,
      "hash_bucket_size": 128
    }
  ]
}

The final geohash values are hashed to buckets. The number of buckets is determined by the hash_bucket_size parameter.

Binary feature

The feature value of a binary feature is 0 or 1. Binary features are suitable for describing certain user features such as the gender.

The feature value of a binary feature is generated by determining whether the input value is in the set specified by the vocab_list parameter. If the input value matches an element in the specified set, the feature value is 1. Otherwise, the feature value is 0.

Sample code:

{
  "FeatureConf": [
    {
      "feature_type": "binary_feature",
      "expression": "gender",
      "vocab_list": ["M"]
    }
  ]
}

If the input value is 0 or 1, use a raw feature rather than a binary feature in the feature configuration.