All Products
Search
Document Center

Platform For AI:Improved swing similarity algorithm

Last Updated:Mar 09, 2026

Learn about the improved swing similarity algorithm, including deployment, parameters, and FAQ.

Improved swing algorithm

Improvement 1: Limit common neighbor count

The original swing algorithm is unsuitable for scenarios where too few users interact with two items simultaneously. From a statistical perspective, insufficient data leads to large errors in results. If too few users interact with two items simultaneously, the swing algorithm produces results with larger margins of error.

For example, consider an extreme case:

Consider an extreme case shown in the following figure. There are three videos: A, B, and C. Ten die-hard Real Madrid fans (X1 to X10) and one music lover (Y) who only listens to music. The Real Madrid fans watch content related to their team, including popular video C and less popular video A. Assume they have all watched 200 videos about Real Madrid. One fan, X1, receives a recommendation for an obscure music video B from a friend. After watching it, X1 dislikes the dark classical music and still prefers the rousing team anthem in video A. The music lover Y listens to all kinds of music and watches both A and B. In this scenario, because only two users co-viewed videos A and B, the swing algorithm produces the score s(A,B)=0.33 > s(A,C)=0.22. However, recommending video C is clearly a better choice than recommending video B when a user watches video A.

image

Solution: This problem occurs when too few users watch two videos simultaneously. Process results based on the number of users who co-viewed the two videos. Add an indicator function Id(.) to the swing formula for denoising. Introduce behavioral weights to down-weigh overly popular users or items. The improved formula is as follows:

image.png

Improvement 2: Support for scenario i2i

Scenario item-to-item (i2i): Based on the swing algorithm, this method learns the co-occurrence of global clicks and scenario-specific clicks. It favors predicting user clicks within specific scenarios. The formula is as follows:

image.png

The following figure shows the structure of global i2i and scenario i2i. Compared to global i2i, scenario i2i excludes users with no scenario-specific intent (no clicks within the scenario). It only retains edges between users and their clicks within the scenario.

image

Deployment

  1. Download the algorithm package to your local machine: swing-1.0.jar.

If download fails, copy the preceding link and paste it into your browser.

  1. Add the JAR package as a resource in your MaxCompute project.

    image.png

    Note: Deploy the package only once per project.

Input and output formats

The input table, which can be a partitioned table, must contain at least the following columns:

  • user_id: User ID or session ID. Session ID is recommended. The type can be BIGINT or STRING. The code does not check the specific type.

  • item_list: A list of items of the STRING type. Use semicolons (;) to separate items. Each item is represented by an ID of BIGINT type.

The item sequence is separated by semicolons. Each clicked item consists of at least three fields: item_id,norm,timestamp,scene. The `item_id` must be the first field.

The `norm` field represents the recent popularity of an item, such as the number of users who clicked it. This value is the magnitude of the current item and is used to penalize extremely popular items to mitigate the "Harry Potter effect". If you do not know how to calculate norm, or if the "Harry Potter effect" is not significant in your data, leave this column empty.

The "Harry Potter effect" refers to a phenomenon where a very popular and widely known item, such as the Harry Potter series of books or movies, dominates recommendation lists, which makes it difficult for other items to receive equal attention.

Note: The `norm` value is related only to the current item and represents its global popularity. It is not related to the current user. For the same item, the `norm` value must be consistent across all records in the training dataset.

The `timestamp` must follow the %Y%m%d%H%M%S format, such as `20190805223205`. If timestamp is not required, use the same value for all items. Organize the `item_list` in chronological order from earliest click to most recent.

The `scene` field is optional. It specifies the scenario where the user behavior occurred and is used to support scenario i2i.

user_id

item_list

12031602

558448406561,137,20190805223205;585456515773,39397,20190806170331;10200442969,81,20190807223820

3954442742

658448406561,137,20190805223206;485456515773,39397,20190806170335

Note: The `item_list` in a single record must not contain duplicate `item_id` values. Keep only one <user, item> pair per day. Process the `user_id` in the input table using a function such as `concat(user_id, date)` to create a virtual session ID.

The output table has the following format. Partition key columns are supported.

  • item_id: Anchor item ID of BIGINT type.

  • similar_items: A list of similar items.

The `similar_items` field uses the format item_id1,score1,coccur1,ori_score1;item_id2,score2,coccur2,ori_score2;.... In this format, `ori_score1` is the original similarity score, `score1` is the score after max-value normalization, and `coccur1` is the number of co-occurrences.

Note: You must create the output table in advance. Ensure that the column types are correct. You can use custom column names.

Example result:

item_id

similar_items

1084315

7876717,0.000047,2,0.003601;6929557,0.000250,2,0.019373;1084342,0.000780,4,0.060325;1089552,0.000963,4,0.074516;1083467,0.008233,5,0.637016;66042,0.012925,6,1.000000

1090195

1090172,0.015136,1,1.000000

Reference commands

In DataWorks, create an ODPS MR node. An ODPS SQL node may cause errors. Use the following command to submit the job:

jar [<GENERIC_OPTIONS>] <MAIN_CLASS> [ARGS];
        -conf <configuration_file>         Specify an application configuration file
        -resources <resource_name_list>    file\table resources used in mapper or reducer, seperate by comma
        -classpath <local_file_list>       classpaths used to run mainClass
        -D<name>=<value>                  Property value pair, which will be used to run mainClass
ARGS: <in_table/input_partition> <out_table/output_partition>

Example:

Command for public cloud (external) users:

##@resource_reference{"swing-1.0.jar"}
jar -resources swing-1.0.jar 
  -classpath swing-1.0.jar 
  -DtopN=150
  -Dmax.user.behavior.count=500
  -Dcommon.user.number.threshold=0
  -Dmax.user.per.item=600
  -Ddebug.info.print.number=10
  -Dalpha1=5
  -Dalpha2=1
  -Dbeta=0.3
  -Dodps.stage.mapper.split.size=1
  com.alibaba.algo.PaiSwing
  swing_click_input_table/ds=${bizdate}
  swing_output/ds=${bizdate}
;

Note: The complete code includes the comment in the first line. Use `-D<key>=<value>` to specify a parameter value. Do not add a space after `-D`.

Command for internal users:

jar -resources swing-1.0.jar 
  -classpath http://schedule@{env}inside.cheetah.alibaba-inc.com/scheduler/res?id=XXXXX
  -DtopN=150
  -Dmax.user.behavior.count=500
  -Dcommon.user.number.threshold=0
  -Dmax.user.per.item=600
  -Ddebug.info.print.number=10
  -Dalpha1=5
  -Dalpha2=1
  -Dbeta=0.3
  -Dodps.stage.mapper.split.size=1
  com.alibaba.algo.PaiSwing
  swing_click_input_table/ds=${bizdate}
  swing_output/ds=${bizdate}
;

Obtain classpath in the Pandora container:

In DataWorks, right-click the swing resource plan and click 'Historical Versions' to obtain the file path, which starts with `http`.

Parameters

Parameter

Description

Parameter type

common.user.number.threshold

The number of concurrent users, which sets filtering strength. A value that is too high may result in too few items. Adjust this parameter based on your business scenario.

The default value is 0.

max.user.per.item

The number of user click sequences used to calculate k-nearest neighbors for each item.

An integer with a default value of 700.

max.user.behavior.count

Maximum length of a user's behavior sequence. If a sequence exceeds this length, it is truncated to keep the most recent items.

An integer. The default value is 600.

debug.info.print.number

The number of records for which to output debug information.

An integer with a default value of 10.

alpha1

A parameter for the swing algorithm. For more information, see Formula [1].

The data type is integer. The default value is 5.

beta

A parameter for the swing algorithm. For more information, see Formula [1].

A real number. The default value is 0.3.

alpha2

A parameter for the swing algorithm. For more information, see Formula [1].

Specifies an integer with a default value of 1.

user.column.name

The name of the user ID or session ID column.

String. The default value is "user_id".

item.list.column.name

The name of the column for the list of {Item ID, Norm}.

string. The default value is "item_list".

topN

The number of k-nearest neighbors to retain for each trigger item.

An integer with a default value of 200.

odps.stage.mapper.split.size

Concurrency control: Data processed per mapper

An integer in M with a default value of 256.

odps.stage.reducer.num

[Concurrency Control] The number of reducers used to calculate item pair similarity.

An integer with a default value of 200.

item.delimiter

The separator for the item list in the input table.

The default value is a semicolon.

item.field.delimiter

The separator for the item information in the input table.

Default value: Comma

pos_norm

The field index for item popularity, starting from 0. In the preceding example, the value is 1.

An integer with a default value of 1.

pos_time

The field index for the timestamp, starting from 0. In the preceding example, the value is 2.

An integer with a default value of 2.

pos_scene

The field index for the scenario name, starting from 0.

The value is an integer with a default value of 3.

target.scene.name

The name of the target scenario, used for scenario i2i modeling.

Global i2i

max.time.span

Maximum interval in days between clicks on two items for them to be considered neighbors.

An integer with a default value of 1.

do_supplement_by_adamic_adar

Specifies whether to use the Adamic/Adar algorithm to supplement results if the number of similar items is less than `topN`.

The data type is boolean. The default value is true.

FAQ

1. java.lang.ClassCastException: com.aliyun.odps.io.LongWritable cannot be cast to com.aliyun.odps.io.Text

FAILED: ODPS-0123131:User defined function exception - Traceback:
java.lang.ClassCastException: com.aliyun.odps.io.LongWritable cannot be cast to com.aliyun.odps.io.Text
 at com.aliyun.odps.udf.impl.batch.TextBinary.put(TextBinary.java:55)
 at com.aliyun.odps.udf.impl.batch.BaseWritableSerde.put(BaseWritableSerde.java:20)
 at com.aliyun.odps.udf.impl.batch.BatchUDTFCollector.collect(BatchUDTFCollector.java:54)
 at com.aliyun.odps.udf.UDTF.forward(UDTF.java:164)
 at com.aliyun.odps.mapred.bridge.LotTaskUDTF.collect(LotTaskUDTF.java:62)
 at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:167)
 at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:162)
 at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:151)
 at com.alibaba.algo.Paiswing$swingI2IReducer.reduce(Paiswing.java:346)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at com.aliyun.odps.mapred.bridge.utils.MapReduceUtils.runReducer(MapReduceUtils.java:160)
 at com.aliyun.odps.mapred.bridge.LotReducerUDTF.run(LotReducerUDTF.java:330)
 at com.aliyun.odps.udf.impl.batch.BatchStandaloneUDTFEvaluator.run(BatchStandaloneUDTFEvaluator.java:53)

The `item_id` in the output table must be of BIGINT type, not STRING type. Pay attention to schema information in the create table statement.

Reference

For more information about the basic Swing algorithm, see Swing algorithm tool.