All Products
Search
Document Center

Artificial Intelligence Recommendation:News industry

Last Updated:Feb 20, 2024

Data description

For scenes in the news industry, you must prepare the following three data tables:

  1. Item table: This table contains all the recent news that can be recommended for the current scene.

    The items that can be added have a quota limit. We recommend that you deduplicate items before you upload the table. The item_id and item_type fields are used together to uniquely identify an item.

  2. User table: This table contains all users who have recently registered on the system.

    The users that can be added have a quota limit. We recommend that you deduplicate users before you upload the table. You can use the imei field or a combination of the user_id and imei fields to uniquely identify a user. For example, in the latter case, you can use the user_id field to identify users who have logged on and use the imei field to identify users who have not logged on. Make sure that all users are unique. When you request recommendation results, you must specify the unique identifiers of users. Otherwise, personalized recommendations cannot be achieved.

  3. Behavior table: This table contains recent behavior data in the current scene.

    We recommend that you provide behavior data of the last one to two weeks. If behavior data cannot be provided due to technical reasons or no historical data is available because the scene is new, you can use the test data provided by Artificial Intelligence Recommendation (AIRec). In this case, the recommendation model may return results that do not meet your requirements in the next two weeks. The recommendation becomes more precise as data is accumulated.

We recommend that you specify as many optional fields in the tables as possible. The more valid optional fields you specify, the better the recommendation results are. If an optional field is not specified, the system uses its default value.

Table schema

Important

1. You must specify the fields that are marked as Required in the Required column of the following tables. The fields that are marked as Required and Recommended in the Required column have significant impacts on the recommendation results. The fields are described in the Value description column.

2. Data is required to start an AIRec instance. You can use a MaxCompute table to upload historical data to start an AIRec instance. In this case, you can leave optional fields empty in the table. However, the table must contain all fields. For more information about the statements to create a table, see the "CREATE TABLE statements" section of this topic.

Item

Field name

Data type

Required

Field description

Valid value

Value description

Example

item_id

string

Required

The unique ID of an item. The value can contain only letters and digits.

Custom

1. The item_id and item_type fields uniquely identify an item.

2. The value of the item_id field can be up to 50 characters in length.

Note: The reported item IDs must be recorded for later use.

34513

item_type

string

Required

The type of the item.

image, article, video, shortvideo, item, recipe, and audio. If the enumerated values do not meet your business requirements, contact technical support.

The uploaded data must match the specified item type. Otherwise, mixed sorting does not take effect.

article

status

string

Required

Specifies whether the item can be recommended.

0 and 1

1: The item can be recommended.

0: The item cannot be recommended.

Note: 1. If you change the value from 0 to 1, the item is immediately added to the recommendation list.

2. If you change the value from 1 to 0, the item is immediately removed from the list.

1

duration

string

Required for the video industry and optional for other industries

The length of the video. The value must be greater than or equal to 0 and less than 36000. Unit: seconds.

Custom

The length of the video.

1000

scene_id

string

Required

The ID of the scene.

News is launched to different scenes. Scenes vary based on the types of users and web pages.

Custom

1. We recommend that you use an acronym or a combination of letters and digits.

2. Do not use colons (:).

3. Do not set this field to -102. This value is an internal value reserved by the system.

4. If only one scene is available, set this field to 1.

5. You can set this field to multiple scene IDs and separate them with commas (,). The scene IDs match different web pages to which news is launched. For more information, see Use scene IDs.

a101,b102

pub_time

string

Required

The time at which the item is released. The value is a UNIX timestamp that is accurate to the second. This field is used to determine whether the item is the latest item.

Custom

If you have high requirements for timeliness, this field is required. This field is used for the recommendation of new items.

1520327038

expire_time

string

Recommended

The time at which the item expires. The value is a UNIX timestamp that is accurate to the second.

Custom

1. If the current system time of your server is later than the value of this field, the item is expired and no longer recommended

2. If all items in the table expire, the service cannot be started.

3. If this field is left empty, the item never expires.

4. This field is used to ensure the timeliness of recommended items. For more information, see Ensure the timeliness of the news.

1520327038

last_modify_time

string

Optional

The time at which the item information was last modified. The value is a UNIX timestamp that is accurate to the second.

Custom

If you make major updates to a released item and have high requirements for timeliness, you can update this field. This field is similar to pub_time. Both of the fields are used to identify new items.

1520327038

title

string

Recommended

The title of the item.

Custom

This field is used for in-depth semantic analysis. If this field is left empty, some results of the algorithm may be lost. We recommend that you set this field.

The Digital Era Provides the Greatest Opportunity

weight

string

Recommended

Specifies whether the item is weighted. Note:

1. For a weighted item, set this field to 100. For an unweighted item, set this field to 1.

2. You must set this field to 100 or 1. Other values are invalid.

3. We recommend that you keep the number of weighted items less than or equal to 10% of the total number of items.

Custom

1. If you leave this field empty, the default value 1 is used.

2. A weighted item is more likely to be recommended.

1

category_level

string

Recommended

The category level, such as level 3.

Custom

If this value does not match the value of the category_path field, the discretization feature is affected.

3

category_path

string

Recommended

The category path, with categories separated by underscores (_).

Custom

1. A category path can contain multiple categories. You must separate the categories with underscores (_).

2. Commas (,) or colons (:) are not allowed. Category paths are used in discretization policies.

12_1024_56

tags

string

Recommended

The tags of the item. Separate multiple tags with commas (,).

Custom

1. Tags are used to describe the characteristics of items. You must manage your own tag library.

2. The algorithm model performs characteristic analysis based on tags.

3. You can create a maximum of 100 tags for one piece of news. We recommend that you create a maximum of 50,000 tags in each tag pool.

4. If tags are confidential business data, we recommend that you convert the tags into digits based on the related mapping rules and upload the desensitized data.

digitalization, artificial intelligence, AI

author

string

Recommended

The author.

Custom

1. Separate multiple authors with commas (,). A single item can have a maximum of 100 authors.

2. Discretization can be implemented based on authors.

Tom

content

string

Optional

The body part of the item.

Custom

You can set this field to the key segment of the content. A maximum of 5,000 characters are allowed. This field is used for semantic analysis.

The High-level Panel on Digital Cooperation submitted its report "The Age of Digital Interdependence" to the UN Secretary-General on Monday, June 10, 2019. Jack Ma, co-chairman of the High-level Panel on Digital Cooperation, said, "I believe the digital era is the greatest opportunity for us. The biggest risk is missing this grand opportunity."

channel

string

Recommended

The channel to which the news belongs, such as economy. One item has only one channel.

Custom

organization

string

Optional

The organizations. Separate multiple organizations with commas (,).

Custom

pv_cnt

string

Optional

The number of exposures in one month.

Custom

During service start, if behavior data in the current scene is sparse, you can add behavior data of other scenes to this field. Non-real-time data is acceptable. If the maintenance cost of these fields is high after the model becomes stable, you can process them at a low priority.

100000

click_cnt

string

Optional

The number of clicks in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

1000

like_cnt

string

Optional

The number of likes in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

100

unlike_cnt

string

Optional

The number of dislikes in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

100

comment_cnt

string

Optional

The number of comments in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

100

collect_cnt

string

Optional

The number of favorites in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

100

share_cnt

string

Optional

The number of shares in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

100

download_cnt

string

Optional

The number of downloads in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

100

tip_cnt

string

Optional

The number of rewards in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

100

subscribe_cnt

string

Optional

The number of follows in one month.

Custom

Non-real-time data is acceptable. You can process this field at a low priority.

100

source_id

string

Optional

The platform by which the item is released to the scene.

Custom

For example, you can use 1 to indicate Taobao and 2 to indicate Tmall.

1

country

string

Optional

The country code.

Custom

Set this field to an ISO 3166-1 alpha-3 code.

CHN

city

string

Optional

The name of the city.

Custom

Hangzhou or Shanghai

features

string

Optional

The item characteristics, which are strings.

Custom

Separate item characteristics with commas (,). The characteristics must be descriptive.

num_features

string

Optional

The item characteristics, which are numerical values.

Custom

Separate item characteristics with commas (,). Make sure that the number of commas (,) in this field is the same for all items.

User

Field name

Data type

Required

Field description

Valid value

Value description

Example

user_id

string

Required for users who have logged on

The unique ID of a user.

Custom

1. This field is required for users who have registered.

2. This field uniquely identifies a user.

1234567

user_id_type

string

Optional

The registration type of the user.

1234

1: app account.

2: mobile phone number.

3: WeChat account.

4: other.

2

imei

string

Required for users who have not logged on

For an Android user, set this field to the MD5 hash value of the International Mobile Equipment Identity (IMEI). For an iOS user, set this field to the MD5 hash value of the Identifier for Advertisers (IDFA).

Custom

1. This field is required for users who have not registered.

2. If the MAC address or the device number is invalid, internal customer portrait information cannot be used. Only the exposure blocking feature is retained.

MD5 hash value of IMEI 358800091015835: 74f25e604e1a9dde7471fe2e25ae54d0, MD5 hash value of IDFA 41B2FD07-695A-4A27-8D26-C30ECE6F7EAD: 06e1565409c9fc4887036b974421****

third_user_name

string

Optional

The name of a third-party user.

Custom

jack

third_user_type

string

Optional

The name of a third-party platform.

Custom

wechat

phone_md5

string

Optional

The MD5 hash value of a mobile phone number. The value must be 32 characters in length.

Custom

d41d8cd98f00b204e9800998ecf8****

gender

string

Optional

The gender of the user.

male, female, and unknown

If gender information is sensitive, you can use digits. For example, use 0 to indicate male, 1 to indicate female, and 2 to indicate unknown.

male

age

string

Optional

The age of the user.

Custom

22

age_group

string

Optional

The age group.

Custom

20-25

country

string

Optional

The code of the country.

Custom

Set this field to an ISO 3166-1 alpha-3 code.

CHN

city

string

Optional

The name of the city.

Custom

Hangzhou or Shanghai

ip

string

Optional

The last logon IP address.

Custom

202.113.XX.XX

device_model

string

Optional

The device model.

Custom

iphoneX

tags

string

Optional

User tags. Separate multiple tags with commas (,).

Custom

Use tags to describe the user.

football, fitness, outdoor

source

string

Optional

The source of the user.

Custom

Toutiao

content

string

Optional

The description of the user.

Custom

register_time

string

Optional

The registration time. The value is a UNIX timestamp that is accurate to the second.

Custom

1520007038

last_login_time

string

Optional

The last logon time. The value is a UNIX timestamp that is accurate to the second.

Custom

1520017038

last_modify_time

string

Optional

The time at which the user information was last modified. The value is a UNIX timestamp that is accurate to the second.

Custom

1520327038

features

string

Optional

The user characteristics, which are strings.

Custom

Separate the user characteristics, such as customer portrait, with commas (,).

num_features

string

Optional

The user characteristics, which are numerical values.

Custom

Separate user characteristics with commas (,). Make sure that the number of commas (,) in this field is the same for all users.

Behavior

Field name

Data type

Required

Field description

Valid value

Value description

Example

item_id

string

Required

The ID of the item.

Custom

The value must be the same as the value of the item_id field in the item table.

34513

item_type

string

Required

The type of the item.

image, article, video, shortvideo, item, recipe, and audio. If the enumerated values do not meet your business requirements, contact technical support.

The value must be the same as the value of the item_type field in the item table.

image

bhv_type

string

Required

The behavior types, such as expose, stay, click, collect, and download.

expose and click

The number of click entries must be less than the number of expose entries. Otherwise, the system may determine that the data is abnormal, and the service cannot be started.

expose

trace_id

string

Required

The request tracking ID. This field is used in A/B testing to determine whether an Alibaba recommendation engine is used.

Alibaba and selfhold

1. If the behavior data is generated based on an Alibaba recommendation engine, set this field to Alibaba. If the behavior data is generated based on a self-developed or self-operated recommendation system, set this field to selfhold.

2. This field is used to generate analytical reports and compare the results in the console.

Alibaba

trace_info

string

Required

The request tracking information. The information is returned when the Recommend API operation is called. You need only to put the information in logs.

Custom

1. If the trace_id field is set to selfhold, set the trace_info field to 1.

2. If the trace_id field is set to Alibaba, the trace_info field is returned in the recommendation result. A value of Alibaba indicates that the behavior is performed on an item that is recommended by AIRec. When you upload behavior data, you can retain the value of the trace_info field for this item.

1007.5911.12351.1002000:::::::

scene_id

string

Required

The ID of the scene.

Custom

1. The ID of the scene where the behavior entry is generated. The value must be one of the scene IDs for the item that corresponds to the behavior. Only one scene ID is allowed.

2. The value of the scene_id field in the behavior table must be included in the value of the scene_id field in the item table.

3. If you do not need to distinguish between scenes, use the default value 1. If the scene ID of the behavior cannot be traced, set this field to -102. For more information, see Use scene IDs.

a101

bhv_time

string

Optional

The time at which the behavior occurs. The value is a UNIX timestamp that is accurate to the second.

Custom

Set this field to the time at which the user performs the behavior.

1520327038

bhv_value

string

Recommended

Behavior details, such as the number of clicks, time spent on the page, and number of purchased items.

Custom

1. For clicks, set this field to 1.

2. For exposures, set this field or leave it empty based on your business requirements.

3. For other behaviors, contact technical support.

500

user_id

string

Required for users who have logged on

The ID of a user.

Custom

1. The value must be the same as that in the user table.

2. If the user does not log on, you can leave this field empty.

1234567

platform

string

Optional

The client platform.

Custom

ios, android, and h5

ios

imei

string

Required for users who have not logged on

For an Android user, set this field to the MD5 hash value of IMEI. For an iOS user, set this field to the MD5 hash value of IDFA.

Custom

1. This field is required for users who have not registered.

2. If the MAC address or the device number is invalid, internal customer portrait information cannot be used. Only the exposure blocking feature is retained.

3. The value must be an MD5 hash value with 32 characters in length.

e2fcdb0f4dce45e35fe2823d7973****

app_version

string

Optional

The version number of the app.

Custom

4.1.10

net_type

string

Optional

The type of the network.

Custom

2G, 3G, 4G, and WIFI

4G

ip

string

Optional

The IP address of the client.

Custom

234.45.13.14

login

string

Optional

Specifies whether the user has logged on.

01

0: The user has not logged on. 1: The user has logged on.

1

report_src

string

Optional

The source of the report.

12

1: the server. 2: the client.

2

device_model

string

Optional

The device model.

Custom

iphoneX

longitude

string

Optional

The longitude.

Custom

128.4

latitude

string

Optional

The latitude.

Custom

78.1

module_id

string

Optional

The ID of the module.

Custom

114

page_id

string

Optional

The ID of the page.

Custom

4

position

string

Optional

The position of the item.

Custom

5

message_id

string

Optional

The unique identifier of a behavior entry.

Custom

If you do not specify this field, the system uses the item_id, item_type, user_id, imei, bhv_type, and bhv_time fields to deduplicate behavior entries.

5

Behavior type

bhv_type

Description

expose

The "expose" behavior on an item. The behavior table must contain expose entries. The number of expose entries must be greater than the number of click entries.

click

The "click" behavior on an item. The behavior table must contain click entries.

like

The "like" behavior on an item.

unlike

The "dislike" behavior on an item.

comment

The "comment" behavior on an item.

collect

The "favorite" behavior on an item.

stay

The "stay" behavior on an item.

share

The "share" behavior on an item.

download

The "download" behavior on an item.

tip

The "reward" behavior on an item.

subscribe

The "follow" behavior on an item.

dislike

The behavior of providing negative feedback. For more information, see Negative feedback.

CREATE TABLE statements

If you use MaxCompute to upload data required for starting an AIRec instance, you can refer to the following CREATE TABLE statements:

--- Create a behavior table in the news industry.
DROP TABLE IF EXISTS behavior_table;
CREATE TABLE IF NOT EXISTS `behavior_table`
(
    trace_id STRING COMMENT "Request tracking ID"
    ,trace_info STRING COMMENT "Request tracking information"
    ,platform STRING COMMENT "Client platform"
    ,device_model STRING COMMENT "Device model"
    ,imei STRING COMMENT "Device ID"
    ,app_version STRING COMMENT "App version number"
    ,net_type STRING COMMENT "Network type"
    ,longitude STRING COMMENT "Longitude"
    ,latitude STRING COMMENT "Latitude"
    ,ip STRING COMMENT "Client IP address"
    ,login STRING COMMENT "Whether the user has logged on"
    ,report_src STRING COMMENT "Source of the report"
    ,scene_id STRING COMMENT "Scene ID"
    ,user_id STRING COMMENT "User ID"
    ,item_id STRING COMMENT "Item ID"
    ,item_type STRING COMMENT "Type of the item "
    ,module_id STRING COMMENT "Module ID"
    ,page_id STRING COMMENT "Page ID"
    ,position STRING COMMENT "Position of the item"
    ,bhv_type STRING COMMENT "Behavior type"
    ,bhv_value STRING COMMENT "Behavior details"
    ,bhv_time STRING COMMENT "Time at which the behavior occurs"
)
PARTITIONED BY 
(
    ds STRING
)
LIFECYCLE 30
;


--- Create a user table in the news industry.
DROP TABLE IF EXISTS user_table;
CREATE TABLE IF NOT EXISTS `user_table`
(
    user_id STRING COMMENT "Unique ID of the user"
    ,user_id_type STRING COMMENT "Registration type of the user"
    ,third_user_name STRING COMMENT "Third-party user name"
    ,third_user_type STRING COMMENT "Third-party platform name"
    ,phone_md5 STRING COMMENT "MD5 hash value of the mobile phone number of the user."
    ,imei STRING COMMENT "Device ID of the user"
    ,content STRING COMMENT "User content"
    ,gender STRING COMMENT "Gender"
    ,age STRING COMMENT "Age"
    ,age_group STRING COMMENT "Age group"
    ,country STRING COMMENT "Country or region"
    ,city STRING COMMENT "City"
    ,ip STRING COMMENT "Last logon IP address"
    ,device_model STRING COMMENT "Device model"
    ,register_time STRING COMMENT "Registration time"
    ,last_login_time STRING COMMENT "Last logon time"
    ,last_modify_time STRING COMMENT "Time at which the user information was last modified"
    ,tags STRING COMMENT "User tags"
    ,source STRING COMMENT "Source of the user"
    ,features STRING COMMENT "Additional user characteristics, which are strings"
    ,num_features STRING COMMENT "Additional user characteristics, which are numerical values"
)
PARTITIONED BY 
(
    ds STRING
)
LIFECYCLE 30
;


--- Create an item table in the news industry.
DROP TABLE IF EXISTS item_table;
CREATE TABLE IF NOT EXISTS `item_table`
(
    item_id STRING COMMENT 'Unique ID of the item'
    ,item_type STRING COMMENT 'Type of the item '
    ,title STRING COMMENT 'Item title'
    ,content STRING COMMENT 'Body part of the item'
    ,pub_time STRING COMMENT 'Release time'
    ,status STRING COMMENT 'Whether the item can be recommended'
    ,expire_time STRING COMMENT 'Time at which the item expires, accurate to the second'
    ,last_modify_time STRING COMMENT 'Last modification time of the item information, accurate to the second'
    ,scene_id STRING COMMENT 'Scene ID'
    ,duration STRING COMMENT 'Duration, in seconds'
    ,category_level STRING COMMENT 'Category level, such as level 3'
    ,category_path STRING COMMENT 'Category path, with categories separated by underscores (_)'
    ,tags STRING COMMENT 'Tags, separated by commas (,)'
    ,channel STRING COMMENT 'Channels, separated by commas (,)'
    ,organization STRING COMMENT 'Organizations, separated by commas (,)'
    ,author STRING COMMENT 'Authors, separated by commas (,)'
    ,pv_cnt STRING COMMENT 'Number of exposures in one month'
    ,click_cnt STRING COMMENT 'Number of clicks in one month'
    ,like_cnt STRING COMMENT 'Number of likes in one month'
    ,unlike_cnt STRING COMMENT 'Number of dislikes in one month'
    ,comment_cnt STRING COMMENT 'Number of comments in one month'
    ,collect_cnt STRING COMMENT 'Number of favorites in one month'
    ,share_cnt STRING COMMENT 'Number of shares in one month'
    ,download_cnt STRING COMMENT 'Number of downloads in one month'
    ,tip_cnt STRING COMMENT 'Number of rewards in one month'
    ,subscribe_cnt STRING COMMENT 'Number of follows in one month'
    ,source_id STRING COMMENT 'Platform by which the item is released to the scene'
    ,country STRING COMMENT 'Country code'
    ,city STRING COMMENT 'City name'
    ,features STRING COMMENT 'Discrete characteristics of the item'
    ,num_features STRING COMMENT 'Continuous characteristics of the item'
    ,weight STRING COMMENT 'Weight of the item'
)
PARTITIONED BY
(
    ds STRING
)
LIFECYCLE 30
;