Best Practices for Cloud Operation and Maintenance of Advertising Business

01 Business Background Introduction

It is a programmatic advertising aggregation SaaS service provider that integrates traffic management and UR tools through its self-developed advertising aggregation management system to monetize and increase revenue for APP advertising. Through a one-stop advertising aggregation platform, we provide professional multi-dimensional data reports, as well as aggregation management, flexible operation methods, and diverse and flexible optimization tools.

As can be seen from the above figure, the changes in service resources during business development: from one server and one database supporting our business development in the early days, there are now over 200 servers, as well as a series of cloud resources such as databases, middleware, and load balancing. During peak business hours, the number of ECSs reached 400 units, and even exceeded 500 units at one point; With the rapid development of business, traffic is also surging, posing a great challenge to both technology and operations.

Three major technical pain points in 02 business development

In the early plans, there was little flexibility in responding to changes in peak traffic, and servers could only be deployed to cope with traffic fluctuations. On some special days, such as the promotion period of 618 and Double 11, there will be explosive growth in traffic, which requires technical personnel to prepare sufficient servers in advance to cope with the surge in traffic and ensure the smooth operation of our business.

The early deployment method is also relatively traditional, which involves running our project on a designated server, developing corresponding startup scripts to generate ECS images, and ensuring the normal operation of the service by creating ECS in batches.

The early technical architecture seemed simple, with only 3 services in total, which were our online operations. Entering service A through load balancing, and calling B and C through service discovery, the three services as a whole support a large part of our current business. In this technical architecture, all servers are fixed, so it is impossible to dynamically adjust server resources in real-time based on traffic changes, resulting in significant redundancy and waste of server resources.

Redundant service resources, difficulties in dealing with sudden high traffic, and time-consuming image generation (typically requiring tens of minutes or even hours) make it difficult to respond to real-time business changes and releases.

03 Our Solution

Currently, we have chosen to add an elastic server group, adopt ECS automatic synchronization of the latest services for publishing, and adopt Jenkins+K8s one click deployment. Deploy the services on one server, and other servers will synchronize the latest services on a regular basis and restart the services, avoiding the need to regenerate ECS images and achieving more flexible publishing.

The above figure is one of our technical architectures, which clearly shows an additional elastic scaling group compared to earlier architectures. The existence of elastic scalability groups not only avoids significant redundancy and waste of server resources, but also better responds to some sudden increase in traffic. In elastic scaling groups, server costs can also be further reduced by deploying a portion of preemptive instances.

This is another business architecture based on K8s, which is now a popular and universal architecture, divided into data layer, service layer, gateway layer, and presentation layer.

1. Data layer, which may include database MySQL, cache Redis, object storage OSS, etc.

2. The service layer, with K8s as the core, including our own Kubernetes, only consumes some resources when publishing, and basically does not consume resources when not publishing.

3. The gateway layer mainly provides services externally through load balancing. Currently, we are trying to apply the service grid to the business and are still in the testing stage. If the testing is stable, we may switch all load balancing to the service grid.

4. Display layer, including PC, mobile, mini programs, etc. In this architecture, Alibaba Cloud platform services are also used, such as registration centers, logging, and other services interspersed throughout the entire business process.

The following is an introduction to common problems and solutions we encounter in practical work.

1. How to cope with the surge in traffic

This is a problem that many businesses encounter during the operation process. The surge in traffic can be divided into two types: one type is predictable traffic, for example, during morning or evening peak hours, the increase in user volume leads to a surge in traffic. This predictable surge in traffic can be guaranteed by a scheduled expansion strategy to ensure business operation; The other type is unpredictable traffic, which can generally be customized through API or SDK monitoring indicators. By borrowing Alibaba Cloud's interface for early expansion, corresponding expansion or contraction can be carried out based on fluctuations and changes in indicators. The implementation method is to write scripts based on one's own business combined with Alibaba Cloud's SDK and API.

2. How to select monitoring indicators

Some monitoring indicators during the scaling process, such as average load, CPU usage, memory usage, network traffic, disk IO, etc., can be selected based on different application types. Taking JAVA applications as an example, they are relatively memory intensive. When the memory proportion reaches 70% -80%, its CPU usage may still be very small; If we monitor its CPU usage, although the CPU usage is within the normal range and the memory may have been used at 90% or higher, the selection of monitoring indicators is not appropriate. We should flexibly choose monitoring indicators based on different applications.

3. How to Select Index Values for Expansion and Shrinkage

This mainly refers to the indicator values for the expansion and contraction of Alibaba Cloud's elastic scaling group. According to our practice, it is not recommended to set equal values, such as expanding when the CPU usage is greater than 50% and shrinking when the CPU usage is less than 50%. This is because both expansion and contraction have cooling time, and if the CPU usage fluctuates around 50%, it may ultimately lead to our inability to achieve expansion or contraction goals.

04 Summary of Four Practical Experiences

We have summarized four practical experiences:

Firstly, do not hand over all ECS to the elastic expansion group for control. Because elastic scaling groups are relatively flexible, if the indicators we set are not too strict, it may lead to disorderly expansion of the ECS or abnormal situations such as the number of ECSs becoming zero, thereby affecting the business.

Secondly, set an appropriate instance upper limit in the elastic scaling group. This is similar to the first one. If the upper limit is not set or the upper limit value is set unreasonably, it may lead to disorderly expansion, application exceptions, or continuous increase in monitoring indicators, ultimately leading to abnormal server numbers and a burden on costs.

Thirdly, deploy an appropriate proportion of preemptive instances. The discount for preemptive instances may be as low as 10% in activities. If the business structure is appropriate, allocating a certain proportion of preemptive instances can effectively reduce costs.

Fourthly, make good use of cloud based operation and maintenance automation services. Alibaba Cloud provides many useful tools, such as using ECS Cloud Assistant, which can perform batch vulnerability repairs or software upgrades on servers.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us