The parable of beer and diapers is a classic case of data mining utilization. The diapers and beer are irrelevant. However, when the diapers and beer are put next to each other on shelves, both of their sales increase. The problem is how to find the hidden correlation between two irrelevant products. To resolve this problem, you can use collaborative filtering, which is one of the algorithms commonly used in data mining. This algorithm enables you to find the hidden correlation between different customers and products.
Collaborative filtering is a correlation rule-based algorithm. This project takes shopping behaviors as an example, including customers A and B and products X, Y, and Z. If both customers A and B have purchased products X and Y, collaborative filtering determines that customers A and B have similar interests in shopping. Collaborative filtering then recommends product Z to customer B because customer A has purchased product Z. In this case, collaborative filtering works based on customers’ interests.
This project shows how to use the customer shopping behaviors recorded before July to find the correlations between products. We then use this information to recommend relevant products to customers. In addition, the project also makes an assessment of the recommendation results. For example, customer A purchased product X before July. Product X is strongly correlated with product Y. The system then recommends product Y to customer A after July and calculates the probability of customer A purchasing product Y.
Data source: the two datasets are provided by the Tianchi challenges, including the shopping behaviors before July and the shopping behaviors after July.
The attributes are as follows:
|user_id||User ID||string||User ID of a customer.|
|item_id||Product ID||string||ID of a product.|
|active_type||Shopping behavior||string||A value of 0 indicates that the product page is viewed by the customer. A value of 1 indicates that the product is purchased. A value of 2 indicates that the product is added to the customer’s favorites. A value of 3 indicates that the product is added to the customer’s shopping cart.|
|active_date||Purchased at||string||Time when the product is purchased.|
The following figure shows the data entries:
The following figure shows the workflow of this project:
Load the dataset recorded before July, use SQL scripts to extract the shopping behaviors, and import the data to the collaborative filtering component. Set the TopN attribute to 1 for the collaborative filtering component. This allows the collaborative filtering component to find the most similar item for each input item and calculate its weight. Analyze the shopping behaviors and then make predictions about items that are most likely to be purchased by the same customer.
The following figure shows the relevant settings:
The following figure shows the collaborative filtering results. The itemid column shows the IDs of the target products. The similarity column shows two colon-separated items: ID of the product that is strongly correlated with the target product and the probability of this product being purchased.
The preceding steps show how to list all strongly correlated products. The following figure shows the workflow of using the product similarity list to make recommendations and predicting the recommendation results. For example, if customer A purchased product X and product X is strongly correlated with product Y, product Y then is recommended to customer A.
This figure shows the statistics components. The full table scan component 1 shows the recommendation list created based on the shopping behaviors before July. By removing any duplicate rows, the final list contains 18,065 entries. The full table scan component 2 shows the number of products (in the recommendation list) that are purchased by the customers. In this project, 90 products are purchased by the customers.
By referencing the recommendation results, we can still make the following improvements to the project:
The project should include all factors that may influence the recommendation results. For example, the shopping behaviors must be time effective. In this project, the dataset includes shopping behaviors recorded in several months. Using outdated data may prevent you from getting the expected recommendation results. Additionally, the project only focuses on the hidden correlations between the products. The attributes of the recommended products are not taken into consideration. For example, whether the products are frequently rated products or not. If customer A bought a cell phone last month, he may not buy another cell phone the next month. In this case, cell phones are infrequently rated products.
To increase the accuracy of the prediction, this project should use a model trained by machine learning. The latent product associations should be only used as supplementary data.