All Products
Search
Document Center

Platform For AI:Use the improved SWING similarity algorithm

Last Updated:Feb 26, 2025

This topic describes how the improved Sliding Window and Incremental Generation (SWING) similarity algorithm works, as well as the instructions for downloading the toolkit, the description of toolkit parameters, and FAQ.

Introduction to improvements

Improvement point 1: limit in the number of common neighbors

The original SWING similarity algorithm is not suitable for scenarios in which the number of users interacting with two items at the same time is extremely low. From a statistical perspective, insufficient data results in significant errors in the outcomes. If the number of users interacting with two items at the same time is extremely low, the SWING similarity algorithm may generate results with significant errors.

The following section provides an example:

The following figure shows the medium to low-frequency video A, the niche video B, the popular video C, 10 football fans from X1 to X10, and the music lover Y. For example, 10 football fans watch 200 videos related to videos A and C. A friend of fan X1 recommends the niche video B, but fan X1 does not like video B and likes video A. Music lover Y watches various music-related videos, including videos A and B. In this case, only two users watch videos A and B at the same time. As a result, the score calculated by using the SWING similarity algorithm is s(A,B) = 0.33 > s(A,C) = 0.22. In this case, recommending video C is better than recommending video B when a user is watching video A.

image.png

Solution: The issue is caused by a small number of users who watch two types of videos at the same time. To resolve the issue, the results must be processed based on the number of users who watch two types of videos at the same time. The solution is to add the Id(.) indicator function to the formulas based on the SWING similarity algorithm to achieve denoising. Behavioral weights are also introduced to suppress popular users or items. Improved formula:

image.png

Improvement point 2: support for the i2i scenario

In the i2i scenario, the SWING algorithm is optimized by learning the co-occurrence of global and scene-specific clicks to predict which items users may click in a scene. This is expressed by using the following formula:

image.png

The following figures show diagrams of global click and scene-specific click scenarios. Compared with the global click scenario, the scene-specific click scenario does not consider users who do not perform click operations and retains only the edges associated with user clicks in a scene.

image.png

Use the improved SWING similarity algorithm

  1. Download the swing-1.0.jar algorithm package to your on-premises machine.

If the package cannot be downloaded, copy the preceding link to a browser to open the package.

  1. Add the .jar package to a MaxCompute project.

    image.png

    Note: You need to only deploy the project once.

Input and output formats

An input table, such as a partitioned table, must contain at least two columns:

  • user_id: the user ID or session ID (recommended) of the BIGINT or STRING type, which is not checked in the code.

  • item_list: the list of items of the STRING type. Separate items with semicolons (;). Each item is specified by an ID of the BIGINT type.

Each item consists of at least three fields, such as item_id,norm,timestamp,scene. Take note that item_id must be placed at the beginning.

norm indicates the recent popularity of an item, such as the number of user clicks on the item. This is also known as the magnitude used to penalize the extremely popular items and handle the ''Harry Potter effect". If you do not know how norm is calculated, or the "Harry Potter effect" of the item is not serious, you can leave this field empty.

The "Harry Potter effect" refers to the phenomenon in which a very popular and widely known project, such as the Harry Potter books or movies, dominates the recommendation list and makes it difficult for other projects to get the same attention.

Note: The value of the norm field is related only to the current item. This value indicates the global popularity of the item and is not related to the current users. The value of the norm field for the same item must be consistent across multiple records in the entire training dataset.

The timestamp field must be in the %Y%m%d%H%M%S format, such as 20190805223205. If you do not require the timestamp field, use the same timestamp value to fill in the field for each item. The items in the list must be listed in chronological order from the earliest click to the most recent click.

The scene field is optional. This field indicates the scene in which the user clicks an item and is used to specify the i2i scenario.

user_id

item_list

12031602

558448406561,137,20190805223205;585456515773,39397,20190806170331;10200442969,81,20190807223820

3954442742

658448406561,137,20190805223206;485456515773,39397,20190806170335

Note: An item list cannot contain duplicate values of the item_id field for the same record. We recommend that you retain only one <user,item> pair every day and convert the value of the user_id field in an input table into concat, such as the user ID or date, as a virtual session ID.

An output table can contain a partition column and must contain the following columns:

  • item_id: the item ID of the BIGINT type.

  • similar_items: the list of similar items.

The similar_items field must be in the item_id1,score1,coccur1,ori_score1;item_id2,score2,coccur2,ori_score2;... format. ori_score1 is the original similarity score, score1 is the normalized score calculated based on the largest value, and coccur1 is the number of co-occurrences.

Note: An output table must be created in advance. The column type must be correct. You can specify custom column names.

Sample result:

item_id

similar_items

1084315

7876717,0.000047,2,0.003601;6929557,0.000250,2,0.019373;1084342,0.000780,4,0.060325;1089552,0.000963,4,0.074516;1083467,0.008233,5,0.637016;66042,0.012925,6,1.000000

1090195

1090172,0.015136,1,1.000000

Reference commands

In DataWorks, create an ODPS MR node and run the following commands to submit the job. If you create an ODPS SQL node, an error may occur.

jar [<GENERIC_OPTIONS>] <MAIN_CLASS> [ARGS];
        -conf <configuration_file>         Specify an application configuration file
        -resources <resource_name_list>    file\table resources used in mapper or reducer, seperate by comma
        -classpath <local_file_list>       classpaths used to run mainClass
        -D<name>=<value>                  Property value pair, which will be used to run mainClass
ARGS: <in_table/input_partition> <out_table/output_partition>

Examples:

Commands for public cloud (out-of-band) users:

##@resource_reference{"swing-1.0.jar"}
jar -resources swing-1.0.jar 
  -classpath swing-1.0.jar 
  -DtopN=150
  -Dmax.user.behavior.count=500
  -Dcommon.user.number.threshold=0
  -Dmax.user.per.item=600
  -Ddebug.info.print.number=10
  -Dalpha1=5
  -Dalpha2=1
  -Dbeta=0.3
  -Dodps.stage.mapper.split.size=1
  com.alibaba.algo.PaiSwing
  swing_click_input_table/ds=${bizdate}
  swing_output/ds=${bizdate}
;

Note: The complete code must include the first line of comment. Specify the parameter value in the -D<key>=<value> format.

Commands for in-band users:

jar -resources swing-1.0.jar 
  -classpath http://schedule@{env}inside.cheetah.alibaba-inc.com/scheduler/res?id=XXXXX
  -DtopN=150
  -Dmax.user.behavior.count=500
  -Dcommon.user.number.threshold=0
  -Dmax.user.per.item=600
  -Ddebug.info.print.number=10
  -Dalpha1=5
  -Dalpha2=1
  -Dbeta=0.3
  -Dodps.stage.mapper.split.size=1
  com.alibaba.algo.PaiSwing
  swing_click_input_table/ds=${bizdate}
  swing_output/ds=${bizdate}
;

Parameters

Parameter

Description

Type

common.user.number.threshold

The number of users interacting with two items at the same time. If you set this parameter to an extremely high value, the number of similar items is extremely low. Configure this parameter based on your business requirements.

0

max.user.per.item

The maximum number of K-nearest neighbors (KNNs) for each item.

700

max.user.behavior.count

The maximum number of items that are associated with each user. If the number of items exceeds the limit, the most recent items that are associated with the user are retained based on the timestamp.

600

debug.info.print.number

The number of records that output debug information.

10

alpha1

The SWING algorithm parameter. For more information, see Formula [1].

5

beta

The SWING algorithm parameter. For more information, see Formula [1].

0.3

alpha2

The SWING algorithm parameter. For more information, see Formula [1].

1

user.column.name

The name of the user or session ID column.

user_id

item.list.column.name

The name of the column for the item list.

item_list

topN

The maximum number of KNNs retained for each trigger item.

200

odps.stage.mapper.split.size

The amount of data processed by each mapper. Unit: MB.

256

odps.stage.reducer.num

The number of reducers that are required to calculate the similarity of the items.

200

item.delimiter

The separator of the item list in the input table.

;

item.field.delimiter

The separator of the item information in the input table.

,

pos_norm

The popularity of the item, which starts from 0. Example: 1.

1

pos_time

The serial number corresponding to the timestamp field in the item list, which starts from 0. Example: 2.

2

pos_scene

The serial number corresponding to the scene field in the item list, which starts from 0.

3

target.scene.name

The name of the scene used for i2i modeling.

Global i2i

max.time.span

The maximum number of days between two clicks on two neighboring items.

1

do_supplement_by_adamic_adar

Specifies whether to use the Adamic or Adar algorithm to determine the top N similar items.

true

FAQ

1. java.lang.ClassCastException: com.aliyun.odps.io.LongWritable cannot be cast to com.aliyun.odps.io.Text

FAILED: ODPS-0123131:User defined function exception - Traceback:
java.lang.ClassCastException: com.aliyun.odps.io.LongWritable cannot be cast to com.aliyun.odps.io.Text
 at com.aliyun.odps.udf.impl.batch.TextBinary.put(TextBinary.java:55)
 at com.aliyun.odps.udf.impl.batch.BaseWritableSerde.put(BaseWritableSerde.java:20)
 at com.aliyun.odps.udf.impl.batch.BatchUDTFCollector.collect(BatchUDTFCollector.java:54)
 at com.aliyun.odps.udf.UDTF.forward(UDTF.java:164)
 at com.aliyun.odps.mapred.bridge.LotTaskUDTF.collect(LotTaskUDTF.java:62)
 at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:167)
 at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:162)
 at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:151)
 at com.alibaba.algo.Paiswing$swingI2IReducer.reduce(Paiswing.java:346)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at com.aliyun.odps.mapred.bridge.utils.MapReduceUtils.runReducer(MapReduceUtils.java:160)
 at com.aliyun.odps.mapred.bridge.LotReducerUDTF.run(LotReducerUDTF.java:330)
 at com.aliyun.odps.udf.impl.batch.BatchStandaloneUDTFEvaluator.run(BatchStandaloneUDTFEvaluator.java:53)

The value of item_id in the output table must be of the BIGINT type and cannot be of the STRING type. Take note that the schema information in a create table statement must be correct.