Uncover FashionAI technology

1. Starting from the recommended technology

user behavior

Starting from the recommendation technology, the first is the recommendation based on user behavior, including the user's click behavior, browsing behavior, and purchase behavior. Recommendation technology has improved the efficiency of users in finding products, and has also brought about an increase in the company's revenue. When the efficiency of recommendation increases to a certain level, bottlenecks will appear. For example, after you buy a top, you continue to push the top for you. This problem has been criticized over the years. If it is based on user behavior, it will develop in the direction of improving this problem.

User portrait

The second is the user portrait. Many people are doing user insights and describing accurate portraits of users. But I have always been skeptical about user portraits. For example, when buying clothes, you may get user behavior data: browsing, clicking, and purchasing. However, if you know the color number, height, weight, and measurements of the user's skin color, how much more accurate is this user portrait than the former? Therefore, the so-called user insights and user portraits are actually very rough today.

knowledge map

Third, we can also make knowledge graphs to help make related recommendations. For example, buy a fishing rod and recommend other fishing gear, and buy a car light to recommend other auto accessories for you. But as of today, the effect of association recommendation is not good enough, and there are still many difficulties.

These are the things that recommending techniques usually consider. So let's use the clothing recommendation field to see what other possibilities there are. For an offline clothing store, what are our core considerations for a shopping guide? It is a related purchase. When a customer buys a piece of clothing, this is not included in the contribution of the shopping guide, and the shopping guide’s performance is counted in the performance of the shopping guide by letting the user buy other related clothes. Therefore, the important thing is related purchases. The important logic in related purchases is collocation. When we make recommendations in a specific field, we have some recommendation logic specific to this field, which is the logic that happens in daily life.

2. Why do we need to rebuild industry knowledge?

Next, let's see how to make a good match. The reason why most users do not match well is that it requires a lot of knowledge and experience. The attributes of the clothes and the design elements are the starting point. Its accuracy and richness must be sufficient. If it is not enough, it will not be possible to make a reliable match.

The typical situation of knowledge graph is to connect many knowledge points through human experience or user data. The generation of knowledge points in the knowledge map is more through common sense. For example, I am a person, who are my friends, and who are my superiors. The knowledge point of "I" is generated through common sense.

There is another type of method called the expert system. For example, we have many celebrities who understand him as an expert, the professional experience he has accumulated. There will be some experts in each field, such as doctors in the medical system, and the expert system is probably the way artificial intelligence generally used before the rise of knowledge graphs.

In addition, there is another layer of knowledge points, which is a more basic part. If there are problems with the knowledge points themselves, there will be problems with the knowledge relationships built on them. Doing AI algorithms on this basis, the effect is not good enough, which may be one of the reasons why artificial intelligence is difficult to implement. Have the courage to rebuild this knowledge point system.

To quote an example from Taobao, the upper part of the picture below is the knowledge system of our operations or designers. This is an example of "collar type", including round neck, slanted collar, and navy collar. It can be seen that the structure is tiled, scattered. Previously knowledge was transmitted from person to person. Especially in a small circle, like a group of designers, the knowledge may be very vague, as long as they can communicate. Another example is the cursive script written by doctors. Doctors can understand each other, but patients can't understand it. A lot of knowledge is used for communication between people, and there is a lot of ambiguity and incompleteness. For example, clothing style, one label is called "workplace style", and the other is called "neutral style". Workplace style and neutral style are visually indistinguishable. If it is difficult for human vision to distinguish, but the accuracy rate of machine recognition exceeds 80%, then there must be something wrong.

There is another category, the person who labels may have problems with their own understanding. To give an extreme example, there was a time when Taobao merchants labeled clothes, and half of the women's clothing were labeled with Korean versions by the merchants. However, it is not a Korean version at all, just because the Korean version sells well, which shows that the labeling by the merchant is not completely correct, and it is necessary to directly judge through the image.

3. Knowledge reconstruction for machine learning

In the past few years, we found Taobao and Tmall's clothing operations, and integrated several editions of operation knowledge to make it regular, but it was still not good enough. Last year, we held the FashionAI competition and cooperated with the clothing department of Hong Kong Polytechnic Institute, and later cooperated with Beijing Institute of Fashion Technology and Zhejiang Polytechnic Institute. In fact, the knowledge system directly given by clothing experts is not enough, because what we need is a machine learning-oriented knowledge system. The machine must be divided into 0 and 1, completeness, ambiguity, visual inseparability, etc. The principles we have summed up must be satisfied as much as possible.

We organize the once scattered knowledge according to the division logic. For example, the collar, we will divide it according to its fabric, design method and neckline edge, and summarize the scattered knowledge points from several dimensions. It was originally a mess of loose sand, but eventually you will see tree-like knowledge. We have sorted out the commonly used attributes of women's clothing, and there are 206 types in total, which does not include the open and constantly expanding and changing attributes of "popular design techniques". This "organization" is much more complicated than everyone imagined. It took 3 to 4 years. In addition to considering the knowledge itself, it is necessary to further examine the difficulty and necessity of data collection corresponding to the knowledge points. For example, the suit collar of women's clothing can be further subdivided into 9 types, which are almost visually inseparable. At this time, it is enough to stay at the granularity of women's suit collar, and no further subdivision will be made.

Sometimes it is difficult to judge in advance whether an attribute can learn a good model. At this time, the definition of the attribute needs to be iterated for multiple rounds. I found that there was a problem with my attribute definition. I went back to redefine it, and then re-collected data and trained the model until the model could meet the requirements. After the knowledge reconstruction is completed, the recognition accuracy of more than a dozen attributes has generally increased by 20%, which is a very large improvement.

We now have 206 women's clothing styles, 166 semantic colors, and knowledge systems such as materials, scenes, and temperatures. How to define color? In the fashion industry, yellow is almost meaningless, and it is meaningful to talk about "lemon yellow". Last year, lemon yellow was popular in women's clothing. We know that the RGB color is 256256256, and there are 2310 colors related to clothing in the Pantone color table, but this color table is full of color numbers, which consumers cannot understand. We built a layer of 560 kinds on top of which have semantic correspondence. The color is determined together with the Beijing Institute of Fashion Technology. It is too detailed to cluster clothes by color, so we built another 166 kinds, which is similar to "lemon yellow" and "mustard green". It is a semantic color, and it is only at this stage that consumers can understand it.

There are still many technical details, such as how to deal with lighting problems, chromatic aberration problems, etc., and there are also many difficulties. Here I will mainly talk about knowledge reconstruction for machine learning.

4. AI makes the big project of knowledge reconstruction feasible

The next question is, I have 206 styles of women's clothing, how can I complete the collection of data to train the model, not to mention that a definition may require multiple rounds of iterative correction?

For example, the sleeve style in the picture below is called wind bell sleeve. A qualified data set requires about 3000 to 4000 pictures. Collecting enough high-quality images is a big challenge. In 2016, in order to make a high-quality dataset of 3,000 to 4,000 images, it was necessary to label more than 100,000 images. At that time, the annotation retention rate was only 1.5%. . The method at that time was similar to that done in academia. First, use one word to search for many pictures, and then find someone to label them. It's more likely that you can't find enough pictures with wind bell sleeves next to them. It's not marked, so you can't find it. Therefore, knowledge reconstruction is indeed a great challenge. No one had the courage to do it before, because you just couldn't do it.

In 2016, it took us 200 days to complete an attribute identification, which included the time spent on definition iterations. In 2017 we used 40 days, in 2018 we used 2.5 days, now we use about 15 hours, and by the end of 2019, we plan to reduce it to 0.5 days. This is a huge change, and we propose "few-shot learning". About three years ago, not many people in academia raised this issue at that time, but we have seen it, because this is what we are suffering from, and we have to start to solve it.

The academic circles mentioned "few-short learning" and small data learning, and more emphasis is placed on how to directly obtain a good model from a small number of samples. The path we choose is different, and we take a detour from the side.

Today, we have completed the 96 commonly used attributes of women's clothing, which is to use our few-sample learning tool SECT (Small, Enough, Comprehensive), from "less" to "enough" to "good enough", the most important thing is SECT Not only has it played a role in the FashionAI business, but it can also do pan-content recognition. To be more precise, it performs well on tasks such as "simple content classification".

In terms of pan-content recognition, we have used the SECT system to complete more than 70 tag recognitions, such as: "illustration, balcony, upper foot" and other tags. We have begun to change the working mode of business personnel and algorithm personnel. We all know that in-depth learning Before, at that time, our business personnel did not dare to ask the algorithm personnel to give a recognition model, because the development cycle was too long. In order to identify something, I had to ask the algorithm personnel to discuss with him, and then the algorithm personnel manually designed the features . In order to make a model that can be launched and used by the industry, it takes at least half a year or a year. This is the previous model. After deep learning became popular in 2013, this problem was transformed. Algorithm personnel will say that with deep learning today, for business personnel, you only need to collect enough pictures, and I will design a good model for you. If the model is not good, it means that the quality of the data you collected is not good. At this time, the operation wanted to collect 5,000 pictures, but found that the cost was still very high.

It is still difficult for us to use SECT to solve the "detection" problem in machine vision today, or the detection task is not a "few samples" problem in our understanding, it should be called "weak supervision" problem under the detection task, weak supervision It is also different from few samples.

5. Outlook for the future

I understand that big data should be divided into two types. One is that your business insights and pattern analysis can only be done on large-scale data. This is really big data; the other is that today’s The ability of machine learning is not good, so much data is necessary to produce a model, this is called pseudo-big data, because as the ability of AI becomes stronger and stronger, the number of samples needed must be less and less.

In the past, some companies advertised that they had a lot of data, such as face data or something, and regarded the data as an asset. This statement will definitely fall slowly, because AI capabilities are getting stronger and stronger, and the amount of data we need is getting smaller and smaller. To what extent will SECT continue to evolve? Maybe the middle-level and shallow-level algorithm personnel are no longer needed. The business personnel will directly go up and provide more than a dozen pictures (no more than 50 pictures) to the system, and the model will return soon. You can test whether it is easy to use. , if not, then iterative learning until the model is easy to use. It is no longer what it used to be, the labeling phase, the training phase, and the testing phase are so far apart. Today, the entire iteration is getting faster and faster. If the iteration can be reduced to the hour or minute level, this has actually become a human-computer interaction learning system, which will bring about great changes in the future.

The operator of the Taobao content platform said that more models have been produced in the past two months than in the previous three years. The algorithm students in our own group also use it to solve various problems other than attribute recognition. For example, before I came to Silicon Valley, the students in the group wanted to identify whether the person in the photo was standing or sitting. Whether it is a dark-skinned person or a yellow-skinned person, etc., we need to produce 6 discriminant models in a short period of time. Today, we can put the model online within a week or two, and the accuracy, recall, and generalization capabilities can all meet the requirements. In the past, this matter was impossible without a year or so.

Many people in the industry summarize the limitations of deep learning, such as the need for big data and lack of interpretability. I think that in the next few years, we will have a new understanding of what is called "sample" and what is called "interpretable". . We published an article on "Visual Exploration" edited by Mr. Zhu Songchun last year, called "How to Make a Practical Image Dataset". This year we plan to write a sequel, which is "How to Make a Practical Image Dataset ( 2) ", we will focus on talking about our experience and prospects in few-sample learning.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us