Why is it said that Ali engineers know fashion best

I. Introduction

The generation and recommendation of clothing collocations are playing an increasingly important role in the fashion front. A set of clothing collocation usually consists of multiple fashionable items, which need to look harmonious visually and logically. Different from traditional product recommendation, the premise of collocation recommendation is to produce collocation first, and this process is full of challenges. Therefore, the generation of matching often requires the intervention of fashionistas. As of March 2018, 1.5 million content creators have entered Taobao. However, there is still a huge gap between the total amount of collocations created and the number of collocations required by Taobao Channel. The focus of this article is to solve this human problem, aiming to generate personalized collocations for each user of Taobao.

There are two requirements for the generation and recommendation of clothing collocations: 1. The collocation generation should conform to the visual appearance; 2. The collocation recommendation should be personalized. The first requirement is often measured by the Compatibility indicator. Early research calculated the collocation rationality between pairs of items, and recent work uses deep networks to solve the collocation rationality problem. The second requirement is to see whether the recommended match meets the user's personalized taste (Personalization). In the previous work, the user was required to explicitly provide a query, such as a picture of a piece of clothing, so as to recommend other matching clothes. Our work focuses on automatically generating personalized collocations without the user's display input, but by learning the user's historical behavior.

In previous studies, these two requirements were usually addressed separately. We hope to build a bridge between collocation generation and collocation recommendation, and we can use a set of architecture to meet the two requirements. Specifically, we capture the user's preferences through the user's historical behavior, and generate personalized and reasonable collocations on this basis. For the rationality requirement, we propose the FOM model, which uses self-attention to learn the relevance of each item to all other items. We designed a masking task, that is, each time we cover up a product in the collocation, let the model predict what the covered product should be based on other products. For personalized requirements, we propose the POG model, using Transformer's encoder-decoder framework to model the user's historical behavior and collocation rationality. Finally, we built a platform called "Dida", with the POG model as the core, to help the personalized production of iFashion channel matching in Taobao.

We have the following four contributions:

The POG model is proposed: a model that can generate a personalized encoder-decoder structure, while considering the rationality of the collocation and the personalization of the user.
Compared with other methods in offline evaluation, our model has significantly improved, increasing FITB (Fill In The Blank) to 68.79% (relative increase of 5.98%), and CP (Compatibility Prediction) to 86.32% (relative increase of 25.81%) .
We deployed the POG model to the actual online platform "Dida". Through online experimental observation, we found that compared with the traditional CF method, POG has improved CTR (Click-Through-Rate) by 70%.
We provided all the data sets in the experiment, including 1.01 million collocations, 583,000 collocation items, and 3.57 million users with a total of 280 million clicks.

2. Dataset

Taobao experts produce thousands of collocations every day. All of these collocations are manually reviewed before going live. By the end of the paper, a total of 1.43 million collocations had been created and reviewed. We selected 80 most common clothing leaf categories (such as sweaters, coats, T-shirts, boots, earrings), and other low-frequency leaf categories were removed. Recalculate the number of products in the collocation, and collocations with less than 4 products are removed. In this way, a total of 1.01 million sets of collocations remained, including 583,000 items.

In addition, we also counted the user's click behavior on fashion items and collocations in the iFashion channel in the past 3 months. A total of 3.57 million users have browsed more than 40 collocations, which are defined as active users. In the click behavior of each user, the click on the collocation is used as the node, and if there are more than 10 single product clicks before that, it will be recorded. The training data consists of this collocation click and the last 50 single product clicks before that. In the end, we obtained 19.2 million training data, including 4.46 million items and 127,000 collocations. Each item in the dataset contains a white background map, title, and leaf category information.

To the best of our knowledge, our dataset is the largest publicly available dataset of clothing collocations with detailed data information. It is also the first dataset that provides user behavior data on collocations. The data set has been provided, and the specific data is shown in the table below:

3. Model

3.1 Multimodal Modeling

For each item f, we calculate a non-linear embedding to express its characteristics. The sources of features are mainly visual and textual. In this article, information from three sources is used: 1) The vector extracted from the white background image of the product using the CNN model; 2) The vector extracted from the product title using TextCNN; 3) The product Graph Embedding vector extracted from the Alibaba Behemoth platform .

We hope that in the embedding space, the distance between similar products is close, while the distance between different products is widened. Therefore, we combine the vectors obtained from product images, text and collaborative filtering, and use triplet loss to learn embedding after a layer of full connection. For each product, we define products of the same leaf category as positive examples, and products of different leaf categories as negative examples, so the calculation of triplet loss is as follows:

By minimizing the loss, the distance between an item and its positive instance is required to be smaller than that of its negative instance.

3.2 FOM model

An outfit consists of a set of fashion items, where each item has a different relationship weight to the other items. In order to capture the association between commodities, we designed a masking task based on the structure of the bidirectional Transformer encoder. Each time an item in the collocation is masked, we let the model fill in the blanks from several candidate item options, thereby learning the relationship between the masked item and other items. Since each item in the collocation is forced to learn its contextual products, the rationality of collocation between two items will be learned by the self-attention mechanism.

Given a set of collocations F, F={f1,...,ft,...fn}. A special mark [MASK] is used to indicate the expression of the masked product, and other unmasked products are expressed by their multi-modal embedding. The input of the entire collocation is expressed as Fmask, so the loss of the FOM model is expressed as:

Pr(.) represents the probability that the model selects the correct masked item.

The architecture diagram of the model is as follows. The model is based on the Transformer encoder, but position embedding is not used, because we regard the items in the collocation as a set, not an ordered sequence. First, Fmask passes through a two-layer fully connected network (which we define as a conversion layer) to obtain H0, and then connects it to a Transformer encoder structure. The Transformer encoder contains multiple layers. In each layer, it contains a sublayer of Multi-Head self-attention (MH) and a sublayer of Position-wise Feed-Forward Network (PFFN).

After l sub-layers, let gmask represent the output corresponding to the input [MASK], and add a softmax layer on top of gmask to calculate the probability of obtaining the true value of the masked product:

Among them, hmask is the embedding of the masked product obtained from the conversion layer. We can use all the products to do softmax, but in practice such a calculation is too large, so we randomly select 3 products that do not appear in this combination in the product pool, and add the real value of the masked products The value constitutes the scoring set of softmax.

3.3 POG model

After learning the rationality of product collocation, we began to think about how to add user-personalized information at the same time to generate a personalized and reasonable collocation. We take advantage of the common encoder-decoder structure in translation problems to "translate" the user's historical behavior into a set of personalized collocations. Use the product sequence U={u1,...ui,...um} that the user has clicked to represent a user, and the collocation that the user has clicked is expressed as F={f1,...ft,...fn}. At each step, we predict the next matching product that should be produced based on the previously produced matching products and the sequence of products clicked by the user. Therefore, for each (U,F) pair, the objective function of POG can be written as:

Pr(.) represents the probability of producing the t+1th collocation product in the case of the previous collocation product and user behavior.

The architecture diagram of the model is as follows. In POG, the encoder takes the items clicked by the user as input. Given a special symbol [START], the decoder starts to produce a matching product at each step. At each step, the model recursively takes the output of the previous step as input. The entire build process stops when it encounters the special symbol [END]. We named the encoder the Per network and the decoder the Gen network. The Per network provides users with personalized information, while the Gen network also considers the personalized information transmitted from Per and its own collocation rationality information.

In the Per network, after two layers of full connection (conversion layer), a p-layer Transformer encoder structure is connected. The Gen network will first be initialized as a pre-trained FOM. In the Gen network, after the conversion layer, a q-layer Transformer decoder structure is connected. Each layer of Transformer decoder consists of three sub-layers. The first one is the Masked Multi-Head self-attention (MMH) sub-layer. The design of this mask can ensure that the currently produced products will only collect the information of the products produced before that. The second is the MH sublayer, which introduces the user's personalized information into the Per network. The third is the PFFN sublayer.

In a training log, we use the items clicked by the user as the input of Per, and the items in the collocation clicked by the user as the ground truth of Gen. Define ht+1 as the conversion layer vector of the t+1th step ground truth, then the probability of predicting the t+1th product is:

Similarly, here we choose 3 randomly selected products and the correct product to form the object of softmax.

In the prediction phase, for the output vt of each step, we retrieve a nearest item vector from the candidate pool as output. When the [END] mark appears, the generation process ends.

3.4 Didi platform

We deployed POG to an online platform, named Dida. The Dida platform includes multiple functions such as product selection, collocation generation, picture combination, personalized recommendation, etc., and supports a million-level commodity pool. As of the deadline for this paper, more than 1 million Ali operations have generated collocations on the platform, and about 6 million collocations are intelligently generated every day, and about 5.4 million users have seen their personalized collocations.

4. Experiment
4.1 Matching rationality

In the previous work with rationality evaluation, two indicators, FITB and CP, were often used to measure:
FITB: Select the correct answer from multiple options, and can form a reasonable match with other products.
CP: The collocation produced by experts is a positive example, and the collocation composed of randomly selected products is a negative example. The model judges whether it is a correct match.

4.1.1 Multimodal comparison

We find that, in both FITB and CP experiments, the modality alone, text, performs best. Adding image modality and CF modality both have a positive effect on the results.

4.1.2 Multi-model comparison

We compared with some previous models: F-LSTM, Bi-LSTM, SetNN. It should be noted that F-LSTM and Bi-LSTM are sequence-based models, and they both treat the input as an ordered sequence. SetNN and FOM are set-based models, they do not specify the order of the input. We did two sets of experiments with ordered input and disordered input respectively:

In the FITB experiments, sequence-based models are sensitive to input order, while ensemble-based models are not. Both F-LSTM and Bi-LSTM perform better with ordered inputs. FOM performs the best among all models, regardless of ordered or unordered input. In the unordered case, it is 18.04% higher than the second-ranked Bi-LSTM; in the ordered case, it is 5.98% higher than the second-ranked Bi-LSTM.

In CP experiments, sequence-based models are still sensitive to input order. In the unordered input, FOM is 34.90% higher than the second Bi-LSTM; in the ordered input, FOM is 25.81% higher than the second Bi-LSTM.

4.2 Personalized recommendation

We verify the effectiveness of personalized recommendation through online experiments. The compared models include F-LSTM and Bi-LSTM. SetNN has no generation ability and is not under consideration. The generated collocations will be randomly delivered to users, which we call RR. In order to compare with the traditional single-item recommendation method, we designed a method, that is, first generate a set of collocations with each item as a trigger, and then use the CF method to calculate the most preferred item for each user, and use the most preferred item Matches with trigger will be delivered to the user. We call this method CF. We have continuously recorded the online CTR of each method for 7 days, as follows:

It can be seen from the figure that no matter what model is used, the CTR of the CF method is higher than that of RR, and the CTR of the POG method is higher than that of CF.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us