By Xinhao Chen (Head of TapTap/IEM/AI Platform)
Serverless saves TapTap a large amount of operation and maintenance (O&M) and development manpower in building applications. With its help, we've brought the infrastructure, or I'd say, resource management, to a relatively high level in the industry without investing in any infrastructure manpower. Visually, the data shows that a full set of AI and big data support is provided for the entire search, advertisements, and recommendations-related business of TapTap with only single-digit manpower invested.
— Xinhao Chen
X.D. Network Inc., founded in 2003, is a global game developer and publisher with extensive experience in R&D, distribution, and agency operations.
In 2016, X.D. Network Inc. launched TapTap, a mobile game community and application store. Players can download free and paid mobile games through official channels and communicate with other players in the community. As of June 2022, TapTap has more than 50 million monthly active users worldwide.
Unlike those app stores with a profit distribution model, TapTap has insisted on channel zero profit distribution, which also determines that the current commercialization of TapTap is mainly driven by advertising. TapTap releases native ads in the app, which, as highly consistent with other non-commercial ads in terms of content, gives users a better experience. Examples include the game recommendation on the homepage, the content recommendation on the discovery page, the suggested words appearing during the search, and the landing page at the end of the search. All advertisements are interspersed between these strategic contents.
The Serverless practice of TapTap is also promoted based on the actual requirements of these business scenarios. The scenarios include the automatic update and deployment of deep learning models that search, advertisements, and recommendations currently rely on, the model experiment recording platform that algorithm staff relies on, and some NLP analysis and processing of new content.
In the early stage, most of the backend services of TapTap were deployed in ECS, and they were managed and deployed through Rundeck, which was not ideal in terms of efficiency and management. Here are four aspects of requirements I have summed up for the upgrade of infrastructure plans:
Here are two mainstream solution architectures for us to choose from. One is a complete solution of ECS instance + in-house Kubernetes, and the other is a Serverless architecture, which uses Serverless Application Engine (SAE) and Function Compute (FC).
After comparison, we chose the latter. On the one hand, Serverless frees us from purchasing machines. It also does not need to purchase ECS in advance. In addition, it comes with some optional default environments. If there are no special requirements, it saves us the trouble of environmental construction. On the other hand, Serverless has integrated many basic components, which allows us to go online without O&M.
Then, in terms of subsequent maintenance, Serverless products have higher billing accuracy than ECS and can achieve minute-level or even second-level billing for a pay-as-you-go billing mode. Compared with the Kubernetes + ECS model, it can save plenty of labor costs in early development and subsequent O&M.
In order to understand the two Serverless products based on some actual experience of using TapTap:
FC decouples the logic of business scheduling and triggering from the business logic. Developers and algorithm developers can control the triggering and scheduling logic in the Function Compute console so they can focus more on business logic design without the need to do additional developments. This makes Function Compute more suitable for business-driven scenarios. When an event occurs, they can apply for resources to run the business logic.
In our opinion, SAE is similar to the enhanced version of Kubernetes with richer features and a full set of microservice capabilities, which can significantly reduce maintenance costs and ensure an out-of-box experience. This one is more suitable for microservice transformation. For example, it directly migrates the old services on ECS to the cloud and can obtain a complete set of containerized O&M solutions without investing in O&M manpower.
The combination of the two can cover most of the business scenarios of TapTap and make sure the application services are all Serverless.
1) Fully Automated Model Deployment and Hour-Level Update Service Triggered by Object Storage Service (OSS)
TapTap provides an automatic model deployment and update service that is triggered by OSS to export and deploy models. For algorithm staff, after you trained your models (whether they are TensorFlow, PyTorch, or Machine Learning Platform for AI models in other formats), you only need to export them to the specified OSS B bucket to trigger the model update and deployment service. This means the deployment is done the moment the models are exported. This way, you can deploy, update, and perform subsequent elastic scaling without relying on other engineering manpower.
2) Model Experiment Management Platform Triggered by HTTP (Web Service)
After the algorithm staff submits model training tasks through the internal model experiment management and parameter platform implemented by HTTP triggers, we will automatically record its training parameters, log addresses, and log instances to realize traceability and management of all experiments. This is considered a Web service that has a frontend, but it is also an internal service that doesn't require high QPS and performance, so it is integrated with FC. Thus, it gives an edge to cost management in that it saves a lot of money, especially when FC has free quotas.
3) A New Content NLP Processing and Parsing Service Triggered by Kafka
When a user posts in TapTap, we will push it to the NLP analysis service provider through Kafka for NLP processing and parsing and save it for later searches. This can enable users to send a piece of content to call the service once and accurately control the cost.
4) Regular Weekly and Daily Statistics on Resource Consumption
The statistics of MaxCompute and EAS resource consumption are triggered weekly or daily. TapTap automatically pulls unstructured consumption bills from the Alibaba Cloud backend, aggregates them to each staff, each task, and each model, and pushes them to everyone in the group. This helps the students in the group improve their awareness of cost and promotes better cost management for businesses.
On the landing of SAE, we chose the estimation service in the group. The service integrates model reasoning, feature development, and sample return required by search, recommendation, and advertising. It is a kind of middle-platform microservice that enables all business lines to access the most mature online estimation service in the group at a very low cost, such as the estimation of the click-through rate of the recommended words on the current search page and of the international version of the game.
After using SAE, TapTap quickly provides Serverless services. Since SAE shields many resource management, environment management, and basic O&M component management tasks, it allows TapTap to quickly launch a set of independent estimation services for new scenarios and new businesses.
At the same time, TapTap integrates the alarm platform, event center, and log service of SAE. TapTap can detect the status of online services in real-time through DingTalk alarms or to see whether OOM, restart, or error log has occurred.
In addition, the service is connected to the Dubbo Go framework, so the service is directly equipped with capabilities (such as service registration and discovery, IP direct connection) and gracefully going online and offline. Compared with the previous mode in which ECS is used, this solution has great advantages in O&M management, launch, and subsequent cost control, mostly covering the whole process from the launch to the subsequent O&M and significantly saving the development cost in the group.
ApsaraDB - April 14, 2020
amap_tech - August 27, 2020
Alibaba Cloud Serverless - April 7, 2022
Alibaba Cloud MaxCompute - July 15, 2021
ApsaraDB - May 19, 2023
Alibaba Clouder - October 23, 2020
Alibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.Learn More
A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.Learn More
Managed Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.Learn More
When demand is unpredictable or testing is required for new features, the ability to spin capacity up or down is made easy with Alibaba Cloud gaming solutions.Learn More
More Posts by Alibaba Cloud Serverless