Community Blog The Secret Behind Taobao's AI-Powered Personalized Recommendations

The Secret Behind Taobao's AI-Powered Personalized Recommendations

This article introduces Alibaba's Artificial Intelligence Online Serving (AI OS) and the evolution of its technical architecture and practices.

The Alibaba Cloud 2021 Double 11 Cloud Services Sale is live now! For a limited time only you can turbocharge your cloud journey with core Alibaba Cloud products available from just $1, while you can win up to $1,111 in cash plus $1,111 in Alibaba Cloud credits in the Number Guessing Contest.

By Shen Jiaxiang (Wufu), senior researcher in the Search and Recommendation Division of the Alibaba Group

1) Introduction to AI OS

This article describes Artificial Intelligence Online Serving (AI OS) and the evolution of its technical architecture, provides a technical overview, and introduces cloud-native products and practices.

AI OS is an online service platform developed by the Alibaba search engineering team that integrates personalized search, recommendation, and advertising. The AI OS engine system supports various business scenarios, including all Taobao Mobile search pages, Taobao Mobile information flows (You May Like), venues for major promotion activities, product recommendations on the Taobao homepage, and personalized recommendation and product selection by category and industry. It covers more than 80% of Taobao Mobile users. AI OS uses a set of technologies to support search, recommendation, and advertising, which is rare in large Internet companies. Alibaba's platform technical strategy essentially involves e-commerce technology and big data AI technology. The e-commerce OS includes commodity management, category management, operation management, and transaction links. In the era of big data and deep learning, AI delivery, search recommendations, and ad serving have become technical scenarios independent of traditional e-commerce. In addition to Taobao Mobile scenarios, AI OS supports all scenarios of the e-commerce scenarios in the Alibaba Group, such as Southeast Asia Lazada, Juhuasuan, Fliggy, Youku, DingTalk, Cainiao, HeMa, Eleme, and Koubei and even cooperates with Alipay, a subsidiary within the internal economy.


In the deep learning era, the AI OS engine system architecture has evolved significantly. However, Alibaba does not develop deep learning technologies independent of search and recommendation systems like other Internet companies. From the technical level to business scenarios, AI OS integrates search, recommendation, information flow, advertising, and deep learning to form a basic engine platform. The platform helps each part develop based on the other parts.

The following figure shows the technologies and concepts involved in AI OS based on AI OS business scenarios.


The first layer displays basic capabilities required by the distributed engine system. Recall and sorting are required for search, recommendation, and advertising. Distributed communication, high-performance index storage, and capabilities for efficient and flexible indexing and updating are required after the distributed engine system is expanded.

The intermediate layer represents the technical and scenario requirements of the deep learning era. For example, deep learning requires sample processing, training, and online prediction, and their application scenarios are personalized delivery. Personalized delivery is represented in search, recommendation, and advertising. Indexes need to support real-time updates, which is important in e-commerce systems.

The last layer contains the resource management, high availability, computing engine support, O&M control, and plug-in support for AI OS.

The following figure shows the technologies of the AI OS engineering system.


The underlying layer is Hippo, which is an effective resource management system.

The top layer includes Taobao, cloud, and advertising businesses, which have been growing in recent years and gradually migrated to AI OS. Many Alibaba technologies and businesses are developed in a bottom-up model. With a strong sense of innovation, we build platform-based search and recommendation from the bottom up to 70% to 80% and promote them to a strategic level to accelerate mid-end building and form a full-coverage layout.

The right part of the figure shows the AI OS middleware, which contains basic components related to actual business features, including: Service positioning: A system that runs tens of thousands of machines needs to have its own mechanism for service positioning. Service monitoring: This component provides service monitoring at a granularity of seconds. The metrics of internal applications are important for distributed system debugging. Index distribution: This is an important basic component for engine data updates. Message queue: This is a high-performance message component built using fragmented host resources. It features low CPU consumption and network throughput and is basically a free component. Layer-2 scheduling and auto scaling: These are important methods for internal minute-level resource scheduling for search, recommendation, and advertising during major promotion activities.

The algorithm, offline, training, and computing platforms on the left were developed in the deep learning era. Sample and feature processing involves the algorithm platform Nebular, which needs to interwork with the training engine X-Deep Learning (XDL). The computing platform provides support for algorithm samples and sample training and is also a powerful technical fulcrum in the Alibaba Group. It grew up with search, and the two promote and support each other.

The middle part shows the most important technical developments in recent years and is closely related to businesses. With Mobile Intelligence, we do more than simple recommendation change and mixed result sorting. We train models and predict deep models on terminals. Taobao Mobile information flows are the largest application scenario for deep learning, training, and prediction in the world. This is our special area of exploration.

The HA3 search engine is Alibaba's classic engine, which supports full-text retrieval. The commercialization engine is a recall engine that supports advertising businesses, keyword matching with ads, and targeted scenario push. The graph engine iGraph is a large graph retrieval engine that supports online graph computing and retrieval. It has multiple online derivation capabilities, including personalized user relationships and knowledge graph. These engines support real-time data update because the AI OS framework supports data and update management. The AI OS framework depends on the technologies on the right and has extended deep learning capabilities.

2) Evolution of the AI OS Technical Architecture

The AI OS architecture can be used as a reference for startups whose scale expands gradually.


Even though AI OS has developed over 10 years, it was originally focused on the Taobao search business. From 2013 to 2015, it optimized the search engine performance and developed a platform-based search engine within the Alibaba system. The classic query processing + search engine + summary service architecture is used to build search. The query processing part involves personalized storage, which was completed using simple KVs. This architecture is adopted by many startups and is also a classic cloud product solution.

From 2015 to 2018, the information flow business emerged, and we abstracted and accumulated data at the underlying search layer (Suez or AI OS framework) and derived the graph engine, prediction engine, search engine, and recommendation engine, which formed the main framework of AI OS. In this process, we also unified the basic search and information flow frameworks of the entire group. This process relied on the promotion of the bottom-up search platform within the Alibaba Group. It was incorporated into the group strategy after proving itself in the industry and earning the admiration of the group.

From 2018 to 2019, we promoted the full-graph architecture based on the deep learning open-source framework TensorFlow. In the deep neural network (DNN) iteration process, directed acyclic graphs (DAGs) in the full-graph architecture describe businesses in a more standardized and general way. We promoted this full-graph architecture to all business lines, including deep learning, business logic adjustment, scenario iteration, and feature adjustment (such as coarse sorting, refined sorting, statistics, and filtering). This architecture dramatically improved the business iteration efficiency.

Code-level plug-in development, such as C++ or Java plug-in development, was used to customize the business logic. Although it meets business needs, it involves high maintenance and upgrade costs. The DAG operator-based expression is used to solve the problem, which was an important technical breakthrough in the past two years. After the operator graph is completed, only the operator implementation needs to be changed, so the operator graph remains unchanged in the event of version upgrades. As a result, the coupling between business iteration and platform upgrades is greatly reduced.

In this process, we also used the search and recommendation technologies in interesting scenarios, such as the Cainiao logistics engine, which is essentially graph retrieval and computing. This engine grew up with our engine system, including the graph engine iGraph. In this scenario, the engine supports the flow and route optimization of hundreds of millions of parcels each day. In addition, DingTalk messages are encrypted in each link from top to bottom, which cannot be achieved in common search engines. The encryption is implemented by engine iteration. In addition to deep learning, we gradually introduced SQL capabilities.

As the mid-end strategy was further implemented in the group, software abstraction and capability derivation developed on the cloud. We have streamlined our businesses with Ant Financial businesses and found new breakthrough points. We are also practicing platform-based and general ideas, use the most effective methods to solve problems, and develop classic application products, such as AIRec (one-stop selected product delivery system) and OpenSearch (one-stop intelligent content search service). The AIRec platform supports 1000+ personalized scenarios in the Alibaba Group, covering Taobao, Tmall, Juhuasuan, and Double 11 business lines, and supports more than 1 billion instances of selected product delivery. OpenSearch is a one-stop intelligent content search service with high search quality that supports large-scale search businesses in the form of a product. It covers most search businesses in the Alibaba Group and supports more than 10,000 business applications. During Double 11, OpenSearch played an important role and supported a peak of 1 million QPS.

3) AI OS Technical Overview

This chapter describes the main AI OS components.

E2E Deep Learning Platform: Nebular and AOP


Deep learning has greatly accelerated model engineering development. Model iteration became more and more frequent, and the network structure became more and more complex, which posed great challenges to the algorithm iteration efficiency, data computing efficiency, and model delivery reliability. We proposed and built the one-stop deep learning modeling platform Nebular for large-scale business scenarios. With Nebular, you can quickly complete the whole algorithm modeling process from feature introduction and sample feature change to model training, evaluation, and delivery. Nebular provides a comprehensive data model verification system to ensure that your offline modeling and model delivery have production-level reliability. Nebular supports full, incremental, and online learning and uses high-level abstraction to switch between different learning modes at low cost.

Large Distributed Deep Learning Framework: XDL


XDL is a distributed deep learning framework based on the open-source framework for scenarios, such as advertising, search, and recommendation. It is specially designed and optimized for high-dimensional sparse features, Internet structured data, and structured models. XDL is the core driving force for intelligent marketing AI technology. It supports deep understanding of users and the intelligent delivery of various marketing products, such as Alimama Express, Branding Ads, and Super Recommendation.

Prediction Engine: RTP


In traditional incremental model learning mode, the current model is restored, continuously trained, and updated daily. Alternatively, the current model is trained in real-time stream mode and updated hourly. The new model takes effect after full switching. The AI-OS-based prediction engine RTP integrates TensorFlow capabilities, supports the real-time update of deep large models, and fully utilizes real-time data distribution to improve the click-through rate (CTR) or conversion rate (CVR) estimation precision and consequent business results. Model features can be updated in real time, and models can be trained in incremental mode. RTP online graphs are decomposed to extract model parameter weights that can be updated to form a large connected and executable subgraph and send model data messages in real time. This ensures offline stream model training and online real-time update. Finally, the online model update cycle is shortened from hours to minutes, and the model takes effect in seconds rather than minutes.

Full-Graph Recommendation Engine: TPP


The Personalization Platform (TPP) provides open and consistent solutions for many personalized businesses in the Alibaba Group. It enables search and recommendation technologies to easily serve business development and allows businesses to quickly find required technologies on the platform. TPP is one of the entries of AI OS. You can compile solution code on TPP and provide external services in the form of scenarios. You do not need to concern yourself with host resource applications or the application deployment structure and do not have to write the service framework. You only need to implement your recommendation logic functions and manage the solution lifecycle from compilation and debugging to release on the TPP product page.

4) AI OS Cloud-Native Products and Practices

Starting from 2014, we have gradually promoted technical capabilities accumulated in the group to external customers. The following figure shows the product portfolio built based on the AI OS system.



OpenSearch is an O&M-free, one-click cloud platform based on Alibaba AI OS technologies. It is a service-based and product-based search platform that completely screens the complex underlying layer of the search system and supports businesses in the form of a standardized product. OpenSearch offers high search quality, and the search performance can be optimized online. You only need to submit content and set a configuration to enjoy search capabilities. You can flexibly configure and select correlations, modify queries, customize word segmentation, and submit industry dictionaries.

We used OpenSearch to unify middle and long tail search businesses in the Alibaba Group. The number of businesses that configured self-service access has reached 1,000, covering all the business units of the Alibaba Group. After several years of engagement, we attracted thousands of users in Alibaba Cloud, including users with typical application scenarios, such as content, e-commerce, and video.


AIRec was developed based on the Alibaba Group AIRec platform and provides one-stop personalized recommendation services. After you submit content and user behavior, AIRec ensures data privacy, applies personalized recommendation algorithms based on different industry templates, and adjusts the performance in real time to provide services. In addition to Alibaba's traditional e-commerce field, we have also made considerable algorithm investments in other industries, such as content and video. After leveraging internal technologies, we can naturally develop new scenarios in a more abstract and general way.

In this process, users must overcome data source tracking specifications and barriers to use. However, we are committed to providing easy-to-use products. In addition, we provide monitoring, debugging, and operation intervention policies in our products to help users improve performance. In the future, AIRec may become an absolute necessity for new small and medium enterprises.


The Elasticsearch service based on the open-source ecosystem was jointly launched by Alibaba and Elastic. Adhering to the open ecosystem, we integrate flexible, and easy-to-use Elasticsearch with our stable, efficient scheduling and control system and continuously iterate in-house innovative features based on user requirements. We will provide whatever users need. Elasticsearch adapts to our system, and they support and promote each other.

Under these three vertical products is the basic cloud technology we have built up, such as ElasticFlow. Data needs to be joined before being imported to the search engine. For example, Elasticsearch needs ElasticFlow. In addition, it ensures the offline out-of-the-box capability of OpenSearch. At this layer, we need detailed computing engine capabilities. The computing engine customizes data collection, development, sharing, and model training capabilities for search and recommendation on Alibaba's computing platform. These capabilities horizontally streamline and share products on the platform and also indicate the technical capabilities accumulated in the AI OS system.

The lower layer is the control platform, the basic search platform, and Alibaba Cloud basic products. We have some ecosystem products at this layer.

The following describes several typical user cases.


During Chinese New Year, we concluded a partnership with Tomorrow Advancing Life (TAL). Its live stream cloud education platform experienced a spike in traffic during the partnership period. It used Alibaba Elasticsearch, Logstash, Kibana, and Beats products. During the coronavirus (COVID-19) epidemic, TAL's peak traffic doubled, and their Elasticsearch clusters needed to be scaled out. We scaled out clusters in minutes, which ensured an excellent experience for TAL users. Doubling the resource scale requires users to call resources and expand data. Then, they also need to return resources when they are not needed. Our auto scaling feature can meet TAL's requirements and help TAL minimize costs.


The preceding figure shows a typical e-commerce app. This app uses OpenSearch, focuses on sports fashion, has many followers, and pays special attention to search performance. The customer previously built its own search system. However, it lacked experience in scaling and search sorting, and the no result rate was 60%. Later, we worked with the customer to optimize their search system by using OpenSearch and Alibaba's search algorithms, including word segmentation, query semantic understanding, and query modification. After the function was released, the no result rate decreased by 80% and the transaction conversion rate improved by 9%. Search optimization promoted an increase in Gross Merchandise Volume (GMV).


The preceding figure shows the well-known IT community CSDN, which uses OpenSearch and AIRec. It previously submitted URLs and asked others to capture and provide indexes. With indexes, it used the large search method to query and make website restrictions. This is a common approach. However, in this mode, the ability to monetize traffic can be restricted by other parties. In addition, the large search solution cannot optimize the search performance. Submitted URLs may not be recorded and cannot ensure result recall and correlation. After the OpenSearch solution was used to customize these capabilities, the search performance surpassed that of the in-house search solution and the original cooperation solution. PV_CTR more than doubled compared to the in-house solution. CSDN now can smoothly support the website search service.


ZhongAn Online P&C Insurance is a major customer of Alibaba Cloud Elasticsearch. We helped it improve search performance, reduce costs, meet multi-table associated query and HA requirements in database retrieval acceleration scenarios, and meet the geo-disaster recovery deployment requirements of financial enterprises.

While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/fight-coronavirus-covid-19

0 0 0
Share on

Alibaba Clouder

2,605 posts | 744 followers

You may also like