PAI-Rec requires three datasets to generate recommendations: a user table, an item table, and a behavior log table. The more fields you provide, the better the recommendation quality. Field names in your data do not need to match the names in the tables below.
Data requirements at a glance
The following table summarizes the required and recommended fields across all three datasets.
| Dataset | Required fields | Recommended fields |
|---|---|---|
| User table | user_id, imei (required for non-logged-in users) | — |
| Item table | item_id, pub_time, price | title, category_level, cate_id_path, cate_name_path, cate1_id, cate2_id, cate_id, cate1_name, cate2_name, cate_name, brand_id, properties, spu_id |
| Behavior log table | user_id, item_id, event, event_time, scene | request_id, exp_id |
User table
The user table contains information about all users registered in your system. Create one partition per day with a full snapshot of all users for that day.
| Field | Required | Description |
|---|---|---|
user_id | Yes | Unique identifier of a user |
user_id_type | No | Registration type: App, Mobile phone number, WeChat, Other |
imei | Yes, for users who haven't logged in | International Mobile Equipment Identity (IMEI) — the device ID |
gender | No | Valid values: male, female, unknown |
age / birthday | No | Age or date of birth |
purchasing | No | Purchasing power derived from historical data or a predictive model |
country | No | Country |
province | No | Region, state, or province |
city | No | City |
register_time | No | Registration timestamp in seconds (example: 1520017038) |
education | No | Education background |
career | No | Occupation |
last_login_time | No | Last login timestamp in seconds (example: 1520017038) |
source | No | Acquisition channel (example: TouTiao, WeChat) |
content | No | Free-text description of the user |
tags | No | Tags describing the user's interests (example: Football, fitness, outdoor) |
Item table
The item table contains information about all items in your system. Create one partition per day with a full snapshot of all items for that day.
Core fields
| Field | Required | Description |
|---|---|---|
item_id | Yes | Unique identifier of an item |
pub_time | Yes | Publication timestamp in seconds |
price | Yes | Actual sales price (float) |
title | Recommended | Item title — used for semantic analysis. If blank, the recommendation algorithm may not perform as expected. |
brand_id | Recommended | Brand ID |
spu_id | Recommended | Standard Product Unit (SPU) ID |
properties | Recommended | Item properties in JSON format. Example: {"material": "cotton", "style": "commuting"} |
item_type | No | Type of the item |
source_id | No | Source e-commerce platform (example: Taobao, Tmall, JD) |
sub_title | No | Subtitle of the item |
expire_time | No | Expiration timestamp in seconds |
description | No | Item details |
origin_price | No | Original price before discount |
discount | No | Discount ratio: price / origin_price |
tags | No | Tags attached by business operations staff, such as a promotion activity ID |
color | No | Color category |
postage | No | Shipping cost; set to 0 for free shipping |
image_url | No | URL to download the item image |
video_url | No | URL to download the item video |
shop_dsr | No | Detailed Seller Ratings (DSR): item description accuracy, customer service, and delivery quality |
sku_id | No | Stock Keeping Unit (SKU) ID |
shop_id | No | Store ID |
prov | No | Region, state, or province where the item is located |
city | No | City where the item is located |
rate | No | Positive feedback rate |
Category fields
Category fields power the recommendation algorithm's understanding of your product taxonomy. The hierarchy must follow the Mutually Exclusive, Collectively Exhaustive (MECE) principle — categories at the same level do not semantically overlap.
| Field | Required | Description |
|---|---|---|
category_level | Recommended | Depth of the category hierarchy (example: 3 for a three-level hierarchy) |
cate_id_path | Recommended | Full category path as IDs, separated by underscores (example: 100_200_300) |
cate_name_path | Recommended | Full category path as names, separated by underscores (example: Electronics_Phones_Smartphones) |
cate1_id | Recommended | Level-1 category ID |
cate2_id | Recommended | Level-2 category ID |
cate_id | Recommended | Leaf-level category ID (the deepest level in the hierarchy) |
cate1_name | Recommended | Level-1 category name |
cate2_name | Recommended | Level-2 category name |
cate_name | Recommended | Leaf-level category name |
Behavior log table
The behavior log table records user interactions with items. Provide at least 30–60 days of data.
Include behaviors from across the entire site, not just recommendation scenarios. Covering search and popular-item scenarios gives the algorithm a more complete view of user intent.
| Field | Required | Description |
|---|---|---|
user_id | Yes | User ID |
item_id | Yes | Item ID |
event | Yes | Behavior type (see Event types) |
event_time | Yes | Timestamp of the behavior in seconds |
scene | Yes | Scenario ID: home_feed (home feed), hot_items (popular items), or search (search results). In search scenarios, also populate the query field. |
imei | No | Device ID |
event_value | No | Behavior value — format varies by event type (see Event types) |
request_id | Recommended | Unique identifier of the recommendation request returned by PAI-Rec. If blank, sample accuracy is reduced and real-time features cannot be enabled. This field can be left blank when creating a recommendation solution, but adding it later requires modifying the training sample code, re-preparing samples, and retraining the model. |
exp_id | Recommended | Experiment bucket ID returned by the PAI-Rec Recommend API. Set to default or other values if the result was not generated by PAI-Rec. |
request_info | No | Tracking information returned by the Recommend API — log it as-is |
query | No | Search query — required when scene is search |
page | No | Page ID (for item detail pages, the item ID) |
source_page | No | Previous page — used to attribute conversions by traffic source |
position | No | Position of the item in the recommendation list |
app_version | No | App version number |
net_type | No | Network type: 3G, 4G, 5G, Wi-Fi |
login | No | Whether the user is logged in |
device_platform | No | Client platform: iOS, Android, H5, Msite |
device_system | No | Device operating system: iOS, Android, PC |
device_model | No | Device model (example: iPhone 5) |
device_brand | No | Device manufacturer (example: Apple, Xiaomi, Huawei) |
longitude | No | Longitude |
latitude | No | Latitude |
ip | No | Client IP address — used to derive country and city features |
Event types
The behavior log supports the following event types. Each event maps to a specific value for the event field and a specific format for event_value.
| Event | event value | event_value | Notes |
|---|---|---|---|
| Expose an item | expose | Leave blank | — |
| Click an item | click | Leave blank | — |
| Like an item | like | Leave blank | — |
| Dislike an item | unlike | Leave blank | — |
| Comment on an item | comment | Comment text | Comment content is used to analyze shopping experience and item quality. |
| Add to favorites | collect | Leave blank | — |
| Dwell on an item | stay | Duration | Any time unit is acceptable, but must be consistent across all entries. |
| Add to cart | cart | <quantity>,<unit_price> | Example: 1,10000. Unit price is in USD, accurate to the cent. |
| Purchase | buy | <quantity>,<unit_price> | Example: 1,10000. Unit price is in USD, accurate to the cent. One buy entry corresponds to one item_id only — split orders with multiple items into separate entries. |
| Rate an item | evaluate | Integer | Use discrete integers in a consistent ascending or descending order. For star ratings, integers 1–5 in ascending order means higher values indicate more positive reviews. |
| Negative feedback | dislike | Leave blank | — |