With hundreds of billions of daily requests, tens of billions of model features, an average ad response time of fewer than 50 milliseconds, and cost savings of 90%, how does Mobvista construct its cloud O&M system?
On May 29, at the Infrastructure Management on the Cloud sub-forum of the Alibaba Cloud Developer Conference, Yu Longshui, a Senior O&M Architect at Mobvista, delivered a speech. The speech was titled O&M System Construction under Cloud Architecture. During the speech, Yu elaborated on how to use the cloud effectively and pointed out five directions for cost optimization on the cloud.
Yu Longshui, Senior O&M Architect of Mobvista
This article is based on Yu Longshui's speech.
First, let's look at the architecture of a simple cloud ecosystem. Compared with traditional IDCs, the architecture of cloud computing has five characteristics:
Since migrating to the cloud is more in line with the future development of enterprises, how can we use the cloud to serve enterprises better?
Considering the particularities of each enterprise, let's look at some common ideas about using the cloud:
1. Configure the VPC. Generally, the network configuration includes VPC and Subnet.
The Subnet of the public cloud is related to its availability zone. One Subnet corresponds to one availability zone. A unified design is required for the association between gateways and the Subnet. Here, the easiest thing to ignore is that the number of reserved IP addresses is insufficient when designing the Subnet. If so, it will lead to the failure of the resource scale-out in the future.
2. Select the computing resources and take elasticity and scalability into account
The public cloud platform supports multiple instance billing modes. For example, the Alibaba Cloud platform supports subscription, on-demand, and preemptible instances. A reasonable combination of different types of instances enables users to utilize resources efficiently at a low cost. Users also need to plan on loads when selecting computing resources. The cloud platform generally provides corresponding loads, such as SLB or ELB.
3. Storage Resource Awareness – Using storage resources properly may cost much less than using computing resources.
4. Permission Management – On the public cloud, permission management involves two aspects:
For example, imagine you want to operate an RDS or OSS instance on an ECS server. To do so, you can simply grant permissions to the ECS server without exposing any access key. You need to authorize the ECS server to access a certain OSS instance or resource at the backend. Many people tend to ignore or are unwilling to do this, but it is relatively safer.
Then, let's look at intelligent monitoring. The monitoring involves three layers, including the resource layer, business layer, and application layer.
The first layer is the resource layer, including resources, such as the CPU, memory, disk, I/O, and network traffic. They are the underlying resources of the cloud platform. Monitoring their statuses ensures stable resource operations.
The second layer is the business layer. It monitors relevant items within a specific business. For instance, in the advertising industry, recall volume, logs, and advertising return latency are monitored to make sure that all abnormalities are captured and handled in a timely manner.
The third layer is the application layer. Users only need to check the status of the application itself, such as port monitoring and service survival, to ensure that the application works properly under normal circumstances.
After constructing a monitoring system, we must make it intelligent. Intelligent monitoring enables machines in different groups to monitor their respective projects accordingly. This way, we can conduct grouped or categorized monitoring of new machines.
Multiple public templates, such as CPU, memory, and disk I/O, are involved during creating a cloud server to be monitored. Then, dedicated templates based on public templates and specific business content are created. Based on the dedicated template, VM instances and Agents are created. Different Agents are associated with different templates automatically. The monitoring on such a group of machines is based on the same template. Finally, a relatively perfect system is formed.
With a monitoring system in place, an alert system is required. The alerting feature can be classified into two categories:
Some of these services can be recovered automatically, so there is no need to send a message. This involves some actions, which should be equipped with some features like auto recovery.
Compared with the automation of IDCs, automation on the cloud is much more necessary because the resources on the cloud are much more complicated than those in IDCs. In addition, the cloud involves more machines in greater variety. In this case, how can automatic O&M management be implemented? O&M personnel can develop their own tools or use the automation tools provided by the cloud platform.
The cloud platform provides some automation tools, including SDKs and CLI in various languages. CLI is a command-line tool. Generally, O&M personnel are better at utilizing command lines than most developers because they can accomplish partial automation with CLI.
At the same time, the cloud platform also provides tuple data and a proprietary IP interface. When users call the IP interface on a local machine, they will see information about the local machine, including instance ID, public IP, and network card information. Such information enables users to do many things. With those tools, we can achieve automation in management.
Generally, the most frequently used features are batch management and group management. The establishment of a CMDB is the core of resource management because all resources must be managed within the CMDB. Here are two tips on how to establish a CMBD:
After establishing a CMBD and installing relevant batch tools, the next step is the routine release. All O&M personnel has to deal with the release, but things get interesting when a new version is released on the cloud.
If a machine happens to be added or upgraded by scaling during release, is the version of the machine the latest or outdated?
Here are several processing skills:
Versions are controlled by GitLab or SVN. When a version is released, send the first copy to a fixed path in advance, which can be OSS or other managed categories. Then, synchronize the version of all machines that are currently online through the batch tool.
If automatic scaling is triggered at this time to scale machine capacity out, the machines can use the latest code from the fixed path directly. This ensures that the versions of both the scaled and released machines are the latest.
Automatic monitoring refers to automatic registration and automatic deregistration. Due to the instability of the cloud, machines may fail to be created during scale-out. The monitoring personnel cannot configure and add the machine for monitoring manually. The latency is not fixed, and the exceptions cannot be handled at night in a timely manner. Therefore, automatic monitoring is required.
The establishment of an advertising platform is tedious, especially in the case of multi-region deployment. Deployment may be needed in Singapore and then in the United States or Germany. Therefore, in terms of automatic deployment, we should make full use of the orchestration tools provided by the public cloud. These tools are named ROS on Alibaba Cloud and CloudFormation on AWS.
Orchestration tools are programmed using special languages. The content is a YAML file that contains all the resources needed. The environment processing in the resources, images, and network construction can be set in advance, which enables the all-in-one deployment of the basic environment.
We use the cloud because of its low cost. However, if we do not use it properly, it could be costlier than IDC.
Five Directions for Cost Savings:
1. Adopt Preemptible Instances – The Solution with the Highest Cost Savings
Preemptible instances are a type of on-demand instances that are billed at a lower price than pay-as-you-go instances. Proper use of preemptible instances can reduce computing costs by up to 90%.
2. Adopt Auto Scaling Service
For businesses, migration to the cloud is characterized using the Auto Scaling service. It helps enterprises reduce the cost by at least 30%.
3. Make Full Use of the Features of OSS
Not everything needs to be stored on compute nodes. OSS supports multiple billing modes, including standard billing, low-access billing, archiving billing, and cold storage billing. Users can write automated scripts or automated tools to track the data access frequency and adopt different billing modes accordingly, which can reduce storage costs significantly.
4. Control Resources
Cleaning up idle resources regularly can save a lot of money.
5. Make Good Use of Tags
Users can use tags for grouping. This helps users determine the resource usage by teams to classify costs and analyze optimization points.
Finally, let's introduce the exclusive SpotMax, which enhances the usage of Spot instances in scaling groups. It also provides a solution in the case of Spot recycle or insufficient resources.
The basic feature of this tool is to replenish resources before the Spot instances are recycled and then add resources for scaling to avoid resource loss. For example, if users adopt preemptible instances, they are informed of the disabling time in advance. This way, users have enough time to replace existing resources with new resources. In a more complex scenario, when users want to add resources, they may not get the Spot instances. If this is the case, they should add an on-demand machine and replace it after finding available Spot instances.
The same is true for scale-out failures in the case of Spot scaleup. If the scale-out fails, add on-demand instances. After the on-demand instances are added, the system continues to poll. When Spot instances are obtained, the on-demand instances are replaced. This ensures the stability of the service and minimizes the cost.
Based on this scheme, we have developed SpotMax. It helps to cut costs by up to 90%.
Spot instances currently provide support for Mobvista's global advertising business and produce good results. Mobvista's advertising platform ranks first in China and is among the top 10 in the world, covering over 200 countries. The platform deals with over 100 billion daily requests, and some models have ten billion features, but the average response time is kept within 50 milliseconds.
Alibaba Cloud Community - March 8, 2022
Aliware - July 21, 2021
Alibaba Cloud ECS - June 3, 2021
Alibaba Container Service - July 16, 2019
Aliware - June 23, 2021
Aliware - October 9, 2021
A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.Learn More
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
Apsara Stack Agility Elastic Compute Service (Alibaba Cloud ZStack) is a light-weight hybrid cloud solution.Learn More
High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.Learn More
More Posts by Alibaba Cloud ECS