Community Blog How Enterprises Can Use the Cloud Efficiently | O&M System Construction under Cloud Architecture

How Enterprises Can Use the Cloud Efficiently | O&M System Construction under Cloud Architecture

This article explains methods to use the cloud effectively and five directions for cost optimization on the cloud.

With hundreds of billions of daily requests, tens of billions of model features, an average ad response time of fewer than 50 milliseconds, and cost savings of 90%, how does Mobvista construct its cloud O&M system?

On May 29, at the Infrastructure Management on the Cloud sub-forum of the Alibaba Cloud Developer Conference, Yu Longshui, a Senior O&M Architect at Mobvista, delivered a speech. The speech was titled O&M System Construction under Cloud Architecture. During the speech, Yu elaborated on how to use the cloud effectively and pointed out five directions for cost optimization on the cloud.

Yu Longshui, Senior O&M Architect of Mobvista

This article is based on Yu Longshui's speech.

Why Do We Choose the Cloud?

First, let's look at the architecture of a simple cloud ecosystem. Compared with traditional IDCs, the architecture of cloud computing has five characteristics:


  1. Higher Development Requirement: Cloud migration involves the capabilities of the single server, request response, load, and forwarding.
  2. Changes in the Number of Servers: Resources on the cloud can be scaled freely and are no longer fixed, which is very challenging for O&M personnel.
  3. High Resource Utilization: The most important feature of the cloud computing platform is its high elasticity, which enables it to provide cloud resources based on user needs and improve resource utilization overall.
  4. Unlimited Business Growth: After cloud migration, users can use cloud resources flexibly and adjust cloud usage at any time. Current businesses and future businesses are free from the constraints of resources.
  5. Higher Security: The public cloud of Alibaba Cloud has a set of built-in security solutions. All Alibaba Cloud users can use them for free. On the cloud, they remind users of the DDoS trigger time, processing status, and results, ensuring higher security.

How Can the Cloud Serve Enterprises Better?

Since migrating to the cloud is more in line with the future development of enterprises, how can we use the cloud to serve enterprises better?

Common Ideas of Using the Cloud


Considering the particularities of each enterprise, let's look at some common ideas about using the cloud:

1.  Configure the VPC. Generally, the network configuration includes VPC and Subnet.

The Subnet of the public cloud is related to its availability zone. One Subnet corresponds to one availability zone. A unified design is required for the association between gateways and the Subnet. Here, the easiest thing to ignore is that the number of reserved IP addresses is insufficient when designing the Subnet. If so, it will lead to the failure of the resource scale-out in the future.

2.  Select the computing resources and take elasticity and scalability into account

The public cloud platform supports multiple instance billing modes. For example, the Alibaba Cloud platform supports subscription, on-demand, and preemptible instances. A reasonable combination of different types of instances enables users to utilize resources efficiently at a low cost. Users also need to plan on loads when selecting computing resources. The cloud platform generally provides corresponding loads, such as SLB or ELB.

3.  Storage Resource Awareness – Using storage resources properly may cost much less than using computing resources.

4.  Permission Management – On the public cloud, permission management involves two aspects:

  • Users are allowed to apply for permission to operate on-cloud resources.
  • The second is the RAM of resources. It is easy to understand that users can control resources. However, few people understand that one resource can control another resource, but this is very necessary.

For example, imagine you want to operate an RDS or OSS instance on an ECS server. To do so, you can simply grant permissions to the ECS server without exposing any access key. You need to authorize the ECS server to access a certain OSS instance or resource at the backend. Many people tend to ignore or are unwilling to do this, but it is relatively safer.

More Intelligent Monitoring

Then, let's look at intelligent monitoring. The monitoring involves three layers, including the resource layer, business layer, and application layer.


The first layer is the resource layer, including resources, such as the CPU, memory, disk, I/O, and network traffic. They are the underlying resources of the cloud platform. Monitoring their statuses ensures stable resource operations.

The second layer is the business layer. It monitors relevant items within a specific business. For instance, in the advertising industry, recall volume, logs, and advertising return latency are monitored to make sure that all abnormalities are captured and handled in a timely manner.

The third layer is the application layer. Users only need to check the status of the application itself, such as port monitoring and service survival, to ensure that the application works properly under normal circumstances.


After constructing a monitoring system, we must make it intelligent. Intelligent monitoring enables machines in different groups to monitor their respective projects accordingly. This way, we can conduct grouped or categorized monitoring of new machines.

Multiple public templates, such as CPU, memory, and disk I/O, are involved during creating a cloud server to be monitored. Then, dedicated templates based on public templates and specific business content are created. Based on the dedicated template, VM instances and Agents are created. Different Agents are associated with different templates automatically. The monitoring on such a group of machines is based on the same template. Finally, a relatively perfect system is formed.

The Escalation of Alert Level

With a monitoring system in place, an alert system is required. The alerting feature can be classified into two categories:

  1. The first category involves sending messages. This can be classified further into high-level cases and low-level cases.
  2. The second category is alert escalation. For example, a 5-minute alert is a low level. If it lasts over 15 minutes, the low level can be escalated to the middle level or high level. At the same time, the alert contact range can be extended.

Some of these services can be recovered automatically, so there is no need to send a message. This involves some actions, which should be equipped with some features like auto recovery.

Tools, Release, and Deployment of O&M Automation

Compared with the automation of IDCs, automation on the cloud is much more necessary because the resources on the cloud are much more complicated than those in IDCs. In addition, the cloud involves more machines in greater variety. In this case, how can automatic O&M management be implemented? O&M personnel can develop their own tools or use the automation tools provided by the cloud platform.


Three O&M Automation Tools

The cloud platform provides some automation tools, including SDKs and CLI in various languages. CLI is a command-line tool. Generally, O&M personnel are better at utilizing command lines than most developers because they can accomplish partial automation with CLI.

At the same time, the cloud platform also provides tuple data and a proprietary IP interface. When users call the IP interface on a local machine, they will see information about the local machine, including instance ID, public IP, and network card information. Such information enables users to do many things. With those tools, we can achieve automation in management.

Generally, the most frequently used features are batch management and group management. The establishment of a CMDB is the core of resource management because all resources must be managed within the CMDB. Here are two tips on how to establish a CMBD:

  1. Generally, the resource information on the cloud platform can be found in SDK. Users can check the corresponding instance information there and then write it to the CMDB. Various approaches for this goal are available. For example, demos with resource information written by others in the GitHub open-source community can be called directly.
  2. Users should use the Go language for CMDB development whenever possible. The Go language provides much better concurrency and affinity than other languages. The Python language is poor in concurrency and takes a long time to call CMDB interfaces. With CMDB, some classification operations on data can be performed from CMDB when machines are called using Ansible or salt.

O&M Automation: Release

After establishing a CMBD and installing relevant batch tools, the next step is the routine release. All O&M personnel has to deal with the release, but things get interesting when a new version is released on the cloud.

If a machine happens to be added or upgraded by scaling during release, is the version of the machine the latest or outdated?

Here are several processing skills:


Versions are controlled by GitLab or SVN. When a version is released, send the first copy to a fixed path in advance, which can be OSS or other managed categories. Then, synchronize the version of all machines that are currently online through the batch tool.

If automatic scaling is triggered at this time to scale machine capacity out, the machines can use the latest code from the fixed path directly. This ensures that the versions of both the scaled and released machines are the latest.

Automatic monitoring refers to automatic registration and automatic deregistration. Due to the instability of the cloud, machines may fail to be created during scale-out. The monitoring personnel cannot configure and add the machine for monitoring manually. The latency is not fixed, and the exceptions cannot be handled at night in a timely manner. Therefore, automatic monitoring is required.


  • Automatic Registration: For example, the most commonly used Zabbix must be registered automatically when starting the agent to register its content. Configurations like metadata are used to report the feature values to the Zabbix server and classify them according to metadata to find different templates. As such, different instances are registered to different templates for grouping of automatic registration.
  • Deregistration: The most difficult part is deregistration rather than registration. We recommend the cloud monitor module of the cloud platform. With cloud monitoring in place, there will be an event for each instance or each resource operation. As long as the event is captured, the content of the event is analyzed and then deregistered on the Server.

O&M Automation: Deployment

The establishment of an advertising platform is tedious, especially in the case of multi-region deployment. Deployment may be needed in Singapore and then in the United States or Germany. Therefore, in terms of automatic deployment, we should make full use of the orchestration tools provided by the public cloud. These tools are named ROS on Alibaba Cloud and CloudFormation on AWS.

Orchestration tools are programmed using special languages. The content is a YAML file that contains all the resources needed. The environment processing in the resources, images, and network construction can be set in advance, which enables the all-in-one deployment of the basic environment.


Five Directions for Cost Optimization on the Cloud


We use the cloud because of its low cost. However, if we do not use it properly, it could be costlier than IDC.

Five Directions for Cost Savings:

1.  Adopt Preemptible Instances – The Solution with the Highest Cost Savings

Preemptible instances are a type of on-demand instances that are billed at a lower price than pay-as-you-go instances. Proper use of preemptible instances can reduce computing costs by up to 90%.


2.  Adopt Auto Scaling Service

For businesses, migration to the cloud is characterized using the Auto Scaling service. It helps enterprises reduce the cost by at least 30%.

3.  Make Full Use of the Features of OSS

Not everything needs to be stored on compute nodes. OSS supports multiple billing modes, including standard billing, low-access billing, archiving billing, and cold storage billing. Users can write automated scripts or automated tools to track the data access frequency and adopt different billing modes accordingly, which can reduce storage costs significantly.

4.  Control Resources

Cleaning up idle resources regularly can save a lot of money.

5.  Make Good Use of Tags

Users can use tags for grouping. This helps users determine the resource usage by teams to classify costs and analyze optimization points.


Finally, let's introduce the exclusive SpotMax, which enhances the usage of Spot instances in scaling groups. It also provides a solution in the case of Spot recycle or insufficient resources.

The basic feature of this tool is to replenish resources before the Spot instances are recycled and then add resources for scaling to avoid resource loss. For example, if users adopt preemptible instances, they are informed of the disabling time in advance. This way, users have enough time to replace existing resources with new resources. In a more complex scenario, when users want to add resources, they may not get the Spot instances. If this is the case, they should add an on-demand machine and replace it after finding available Spot instances.

The same is true for scale-out failures in the case of Spot scaleup. If the scale-out fails, add on-demand instances. After the on-demand instances are added, the system continues to poll. When Spot instances are obtained, the on-demand instances are replaced. This ensures the stability of the service and minimizes the cost.

Based on this scheme, we have developed SpotMax. It helps to cut costs by up to 90%.

Spot instances currently provide support for Mobvista's global advertising business and produce good results. Mobvista's advertising platform ranks first in China and is among the top 10 in the world, covering over 200 countries. The platform deals with over 100 billion daily requests, and some models have ten billion features, but the average response time is kept within 50 milliseconds.

0 0 0
Share on

Alibaba Cloud ECS

35 posts | 9 followers

You may also like


Alibaba Cloud ECS

35 posts | 9 followers

Related Products