This topic describes how the improved Sliding Window and Incremental Generation (SWING) similarity algorithm works, as well as the instructions for downloading the toolkit, the description of toolkit parameters, and FAQ.
Introduction to improvements
Improvement point 1: limit in the number of common neighbors
The original SWING similarity algorithm is not suitable for scenarios in which the number of users interacting with two items at the same time is extremely low. From a statistical perspective, insufficient data results in significant errors in the outcomes. If the number of users interacting with two items at the same time is extremely low, the SWING similarity algorithm may generate results with significant errors.
The following section provides an example:
The following figure shows the medium to low-frequency video A, the niche video B, the popular video C, 10 football fans from X1 to X10, and the music lover Y. For example, 10 football fans watch 200 videos related to videos A and C. A friend of fan X1 recommends the niche video B, but fan X1 does not like video B and likes video A. Music lover Y watches various music-related videos, including videos A and B. In this case, only two users watch videos A and B at the same time. As a result, the score calculated by using the SWING similarity algorithm is s(A,B) = 0.33 > s(A,C) = 0.22
. In this case, recommending video C is better than recommending video B when a user is watching video A.
Solution: The issue is caused by a small number of users who watch two types of videos at the same time. To resolve the issue, the results must be processed based on the number of users who watch two types of videos at the same time. The solution is to add the Id(.) indicator function to the formulas based on the SWING similarity algorithm to achieve denoising. Behavioral weights are also introduced to suppress popular users or items. Improved formula:
Improvement point 2: support for the i2i scenario
In the i2i scenario, the SWING algorithm is optimized by learning the co-occurrence of global and scene-specific clicks to predict which items users may click in a scene. This is expressed by using the following formula:
The following figures show diagrams of global click and scene-specific click scenarios. Compared with the global click scenario, the scene-specific click scenario does not consider users who do not perform click operations and retains only the edges associated with user clicks in a scene.
Use the improved SWING similarity algorithm
Download the swing-1.0.jar algorithm package to your on-premises machine.
If the package cannot be downloaded, copy the preceding link to a browser to open the package.
Add the .jar package to a MaxCompute project.
Note: You need to only deploy the project once.
Input and output formats
An input table, such as a partitioned table, must contain at least two columns:
user_id: the user ID or session ID (recommended) of the BIGINT or STRING type, which is not checked in the code.
item_list: the list of items of the STRING type. Separate items with semicolons (;). Each item is specified by an ID of the BIGINT type.
Each item consists of at least three fields, such as item_id,norm,timestamp,scene
. Take note that item_id must be placed at the beginning.
norm indicates the recent popularity of an item, such as the number of user clicks on the item. This is also known as the magnitude used to penalize the extremely popular items and handle the ''Harry Potter effect". If you do not know how norm is calculated, or the "Harry Potter effect" of the item is not serious, you can leave this field empty.
The "Harry Potter effect" refers to the phenomenon in which a very popular and widely known project, such as the Harry Potter books or movies, dominates the recommendation list and makes it difficult for other projects to get the same attention.
Note: The value of the norm field is related only to the current item. This value indicates the global popularity of the item and is not related to the current users. The value of the norm field for the same item must be consistent across multiple records in the entire training dataset.
The timestamp field must be in the %Y%m%d%H%M%S
format, such as 20190805223205. If you do not require the timestamp field, use the same timestamp value to fill in the field for each item. The items in the list must be listed in chronological order from the earliest click to the most recent click.
The scene field is optional. This field indicates the scene in which the user clicks an item and is used to specify the i2i scenario.
user_id | item_list |
12031602 | 558448406561,137,20190805223205;585456515773,39397,20190806170331;10200442969,81,20190807223820 |
3954442742 | 658448406561,137,20190805223206;485456515773,39397,20190806170335 |
Note: An item list cannot contain duplicate values of the item_id field for the same record. We recommend that you retain only one <user,item> pair every day and convert the value of the user_id field in an input table into concat, such as the user ID or date, as a virtual session ID.
An output table can contain a partition column and must contain the following columns:
item_id: the item ID of the BIGINT type.
similar_items: the list of similar items.
The similar_items field must be in the item_id1,score1,coccur1,ori_score1;item_id2,score2,coccur2,ori_score2;...
format. ori_score1 is the original similarity score, score1 is the normalized score calculated based on the largest value, and coccur1 is the number of co-occurrences.
Note: An output table must be created in advance. The column type must be correct. You can specify custom column names.
Sample result:
item_id | similar_items |
1084315 | 7876717,0.000047,2,0.003601;6929557,0.000250,2,0.019373;1084342,0.000780,4,0.060325;1089552,0.000963,4,0.074516;1083467,0.008233,5,0.637016;66042,0.012925,6,1.000000 |
1090195 | 1090172,0.015136,1,1.000000 |
Reference commands
In DataWorks, create an ODPS MR node and run the following commands to submit the job. If you create an ODPS SQL node, an error may occur.
jar [<GENERIC_OPTIONS>] <MAIN_CLASS> [ARGS];
-conf <configuration_file> Specify an application configuration file
-resources <resource_name_list> file\table resources used in mapper or reducer, seperate by comma
-classpath <local_file_list> classpaths used to run mainClass
-D<name>=<value> Property value pair, which will be used to run mainClass
ARGS: <in_table/input_partition> <out_table/output_partition>
Examples:
Commands for public cloud (out-of-band) users:
##@resource_reference{"swing-1.0.jar"}
jar -resources swing-1.0.jar
-classpath swing-1.0.jar
-DtopN=150
-Dmax.user.behavior.count=500
-Dcommon.user.number.threshold=0
-Dmax.user.per.item=600
-Ddebug.info.print.number=10
-Dalpha1=5
-Dalpha2=1
-Dbeta=0.3
-Dodps.stage.mapper.split.size=1
com.alibaba.algo.PaiSwing
swing_click_input_table/ds=${bizdate}
swing_output/ds=${bizdate}
;
Note: The complete code must include the first line of comment. Specify the parameter value in the -D<key>=<value> format.
Commands for in-band users:
jar -resources swing-1.0.jar
-classpath http://schedule@{env}inside.cheetah.alibaba-inc.com/scheduler/res?id=XXXXX
-DtopN=150
-Dmax.user.behavior.count=500
-Dcommon.user.number.threshold=0
-Dmax.user.per.item=600
-Ddebug.info.print.number=10
-Dalpha1=5
-Dalpha2=1
-Dbeta=0.3
-Dodps.stage.mapper.split.size=1
com.alibaba.algo.PaiSwing
swing_click_input_table/ds=${bizdate}
swing_output/ds=${bizdate}
;
Parameters
Parameter | Description | Type |
common.user.number.threshold | The number of users interacting with two items at the same time. If you set this parameter to an extremely high value, the number of similar items is extremely low. Configure this parameter based on your business requirements. | 0 |
max.user.per.item | The maximum number of K-nearest neighbors (KNNs) for each item. | 700 |
max.user.behavior.count | The maximum number of items that are associated with each user. If the number of items exceeds the limit, the most recent items that are associated with the user are retained based on the timestamp. | 600 |
debug.info.print.number | The number of records that output debug information. | 10 |
alpha1 | The SWING algorithm parameter. For more information, see Formula [1]. | 5 |
beta | The SWING algorithm parameter. For more information, see Formula [1]. | 0.3 |
alpha2 | The SWING algorithm parameter. For more information, see Formula [1]. | 1 |
user.column.name | The name of the user or session ID column. | user_id |
item.list.column.name | The name of the column for the item list. | item_list |
topN | The maximum number of KNNs retained for each trigger item. | 200 |
odps.stage.mapper.split.size | The amount of data processed by each mapper. Unit: MB. | 256 |
odps.stage.reducer.num | The number of reducers that are required to calculate the similarity of the items. | 200 |
item.delimiter | The separator of the item list in the input table. | ; |
item.field.delimiter | The separator of the item information in the input table. | , |
pos_norm | The popularity of the item, which starts from 0. Example: 1. | 1 |
pos_time | The serial number corresponding to the timestamp field in the item list, which starts from 0. Example: 2. | 2 |
pos_scene | The serial number corresponding to the scene field in the item list, which starts from 0. | 3 |
target.scene.name | The name of the scene used for i2i modeling. | Global i2i |
max.time.span | The maximum number of days between two clicks on two neighboring items. | 1 |
do_supplement_by_adamic_adar | Specifies whether to use the Adamic or Adar algorithm to determine the top N similar items. | true |
FAQ
1. java.lang.ClassCastException: com.aliyun.odps.io.LongWritable cannot be cast to com.aliyun.odps.io.Text
FAILED: ODPS-0123131:User defined function exception - Traceback:
java.lang.ClassCastException: com.aliyun.odps.io.LongWritable cannot be cast to com.aliyun.odps.io.Text
at com.aliyun.odps.udf.impl.batch.TextBinary.put(TextBinary.java:55)
at com.aliyun.odps.udf.impl.batch.BaseWritableSerde.put(BaseWritableSerde.java:20)
at com.aliyun.odps.udf.impl.batch.BatchUDTFCollector.collect(BatchUDTFCollector.java:54)
at com.aliyun.odps.udf.UDTF.forward(UDTF.java:164)
at com.aliyun.odps.mapred.bridge.LotTaskUDTF.collect(LotTaskUDTF.java:62)
at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:167)
at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:162)
at com.aliyun.odps.mapred.bridge.LotReducerUDTF$ReduceContextImpl.write(LotReducerUDTF.java:151)
at com.alibaba.algo.Paiswing$swingI2IReducer.reduce(Paiswing.java:346)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.aliyun.odps.mapred.bridge.utils.MapReduceUtils.runReducer(MapReduceUtils.java:160)
at com.aliyun.odps.mapred.bridge.LotReducerUDTF.run(LotReducerUDTF.java:330)
at com.aliyun.odps.udf.impl.batch.BatchStandaloneUDTFEvaluator.run(BatchStandaloneUDTFEvaluator.java:53)
The value of item_id in the output table must be of the BIGINT type and cannot be of the STRING type. Take note that the schema information in a create table
statement must be correct.