This topic describes the Swing algorithm created by Alibaba.
Introduction to the Swing algorithm
Swing is a new matching algorithm created by Alibaba. Unlike traditional algorithms that calculate node similarity based on node proximity, such as common neighbors, Adamic/Adar, cosine similarity, Jaccard similarity, cosine, and Rooted PageRank, Swing considers the graph structure and can extend to two-hop nodes with a high-dimensional graph structure. Swing is anti-noise and significantly improves accuracy compared with traditional collaborative filtering algorithms.
Item-to-item (I2I) indexes obtained by Swing, as the most important basic data, are widely used in many recommendation scenes of Taobao on mobile phones and PCs. Swing is also used in advertising services of TTPOD and Alimama of Alibaba Group and brings obvious benefits.
Sample preparation
Create an input table.
CREATE TABLE IF NOT EXISTS swing_test_input
(
user_id bigint,
item_list string -- item_list is a mandatory field, and consists of at least the item_id, timestamp, and norm sub-fields. item_list contains the items in sequence that are clicked by a user.
)
lifecycle 7; Items in item_list are separated with semicolon (;). The information of each item consists of at least three sub-fields: item_id, norm, and timestamp. item_id must be the first sub-field. The value of the timestamp sub-field must be in the %Y%m%d%H%M%S format. If you do not need to use the timestamp sub-field, you can specify the same value for all the items. norm represents the recent popularity (number of clicks) of an item. If you do not need to use it, you can specify 1 for all the items. The items in item_list should be listed in order from the earliest to the most recent based on the time when the items are clicked.
The value of item_id must be of the numeric type.
Create an output table.
CREATE TABLE IF NOT EXISTS swing_test_result
(
item_id BIGINT COMMENT 'Anchor item ID',
item_list STRING COMMENT 'List of similar items'
)
LIFECYCLE 7;The output table consists of two columns: item_id and item_list. The item_list column is in the following format: item_id1,score1,coccur1,ori_score1;item_id2,score2,coccur2,ori_score2. ori_score1 is the original similarity score. score1 is the score generated after the maximum value is normalized. coccur1 is the number of co-occurrences.
PAI command
pai -name swing_rec_ext
-project algo_public
-DinputTable='swing_test_input/ds=20250809'
-DoutputTable='swing_test_result/ds=20250809'
-DmaxClickPerUser='500'
-DmaxUserPerItem='600'
-Dtopk='100'
-Dalpha1='5'
-Dalpha2='1'
-Dbeta='0.3'Algorithm parameters
Parameter | Description | Parameter type |
inputTable | Input table: User click sequence; supports both partitioned and non-partitioned tables. | Use "/" to concatenate the table name and partition value. |
outputTable | Output table: i2i index; supports both partitioned and non-partitioned tables. | Use "/" to concatenate the table name and partition value. |
maxClickPerUser | The maximum length of the item list of each user. If the length of an item list exceeds the limit, the most recent items are truncated and retained. | Integer. Default value: 600. |
maxTimeSpan | The maximum number of days between two clicks within which the two clicked items are considered to have a neighbor relationship. | Integer. Default value: 1. |
maxUserPerItem | The number of users who clicked each item. This number is used to calculate the k-nearest neighbors of each item. | Integer. Default value: 700. |
topk | The number of k-nearest neighbors reserved for each trigger item. | Integer. Default value: 200. |
alpha1 | A Swing algorithm parameter. For more information, see Formula [1]. | Integer. Default value: 5. |
beta | A Swing algorithm parameter. For more information, see Formula [1]. | Real number. Default value: 0.3. |
alpha2 | A Swing algorithm parameter. For more information, see Formula [1]. | Integer. Default value: 1. |
pos_time | The serial number of the field corresponding to timestamp. It starts from 0. In the preceding example, the serial number of the field is 2. | Integer. Default value: 2. |
pos_norm | The serial number of the field corresponding to the item popularity. It starts from 0. In the preceding example, the value is 1. | Integer. Default value: 1. |
Formula [1]: