Construction of operation and maintenance system under cloud architecture

Why use the cloud?

First, let's look at a relatively simple original cloud architecture. Compared with the traditional IDC, the architecture of cloud computing has five characteristics:

1. Higher development requirements. When migrating to the cloud, you need to consider stand-alone capabilities, request capabilities, load capabilities, and forwarding capabilities.

2. The number of servers changes. The resources on the cloud can be flexibly scaled freely, and the resources are no longer fixed, which poses a high challenge to the operation and maintenance personnel.

3. High resource utilization. The biggest feature of the cloud computing platform is its high flexibility, which can provide cloud resources according to user needs and improve the overall resource utilization.

4. Unlimited business growth. After going to the cloud, you can use cloud resources flexibly and adjust the cloud usage at any time. Neither current business nor future business is limited by resources.

5. High security. Public clouds like Alibaba Cloud have a set of built-in security solutions. As long as you use the cloud, you can use this security solution for free. On the cloud, the security solution will remind the DDoS trigger time, processing situation and result. Therefore, the security will be improved accordingly.

How to efficiently use the cloud to serve enterprises?

Since going to the cloud is more in line with the development direction of the enterprise than IDC, how to efficiently use the cloud to serve the enterprise?

The most common idea of ​​using the cloud

Considering the particularity of each business, let’s first take a look at the most common ideas for using the cloud.

Configure VPC first. Usually the network configuration includes the selection of VPC and Subnet.

A public cloud subnet is related to its availability zone, and a subnet corresponds to an availability zone. What kind of gateways the Subnet needs to be associated with must also be designed in a unified manner.

Here, what is most likely to be overlooked is that insufficient IP reservations when designing the Subnet will result in the inability to roll out machines during subsequent resource expansion.

Select a compute resource. Pay attention to whether the planning needs to have elasticity and scalability.
There are various billing methods for instances on public cloud platforms. For example, there are annual, monthly, on-demand, and preemptive billing methods on the Alibaba Cloud platform. Only by adopting a multi-proportion allocation can both high resource utilization and low cost be achieved. When doing computing resources, you must also plan the load. Cloud platforms generally provide corresponding loads, such as SLB or ELB.

Storage resource awareness. If storage resources are used well, it may save more than computing resources.

Authority piping. In the public cloud, permission piping involves two parts:

Users can apply for a permission to operate cloud resources;

The second is the resource RAM. As a user, you can control resources, which is easy to understand. But few people understand that one resource can be used to control one resource, but it is very necessary.

To give an example: I have an ECS machine and want to operate an RDS or OSS. I just need to give permission to this ECS, without exposing any access key, I just need to authorize him a certain OSS or a certain resource permission from the background. This part is something that many people will ignore or are unwilling to match, but it is relatively safer.

Monitoring must be intelligent

Next, let's take a look at intelligent monitoring. Monitoring is mainly divided into three layers: resource, business, and application.

First, the resource layer. Such as CPU status, memory, disk, IO, and network traffic, these belong to the underlying resources of the cloud platform. Monitoring the status of resources can ensure the stable operation of resources.

Second, the business layer. In-depth monitoring items in the business, such as the advertising business will monitor the amount of recalls, log conditions, and advertising return delays, etc., to ensure that business exceptions can be captured and processed in a timely manner.

Third, the application layer. You only need to look at the status of the application itself, such as port monitoring, service survival, etc., to ensure that the application provides services normally under normal resources.

After the monitoring is done, it must be intelligent to a certain extent. Monitoring intelligence is mainly reflected in allowing different groups of machines to monitor different groups of projects, so that new machines can be grouped and classified for monitoring.

When creating a monitored cloud server, there will be multiple public templates. The CPU, memory, and disk IO that everyone will use belong to the public category, and then create proprietary templates based on the public and based on different business content. Based on a proprietary template, the last is the cloud host and the Agent. The Agent will automatically associate with different templates. Such a group of machines are all monitored based on the same template, and finally form a relatively complete system.

Upgrade of monitoring alarm

There are monitoring alarms that need to be monitored, and the alarms are divided into two categories:

One category is sent message. There are high level and low level.

The other category is alarm escalation. For example, it is a low level in the case of an alarm for 5 minutes, and the low level can be upgraded to a middle level or a high level after 15 minutes, and the scope of the alarm can be expanded at the same time.

Among them, some services can be automatically restored, so there is no need to send a message, and you can restore it yourself. This involves some actions, which must have some functions like auto recovery.

Tools, Release and Deployment of O&M Automation
The necessity of automation on the cloud is much stronger than that of IDC, because the resources on the cloud are much more complex than IDC, and there are more machines and more types. How to do operation and maintenance automation management? The operation and maintenance personnel can develop it by themselves or use the automation tools provided by the cloud platform to realize it.

Three O&M automation tools

The cloud platform will provide some automation tools, including SDK and CLI tools in various languages. CLI is a command-line tool. Usually, the operation and maintenance personnel are incomparable to most developers when using the command line. People can use the form of CLI to realize partial automation.

At the same time, the cloud platform also provides tuple data and a dedicated IP interface. When you adjust the IP interface on the local machine, you will see the information on the local machine, including instance ID, public IP, and network card information. You can use Use this information to do something.

Now that the tools are used, automated management can be performed.

Usually, the most used is how to execute batches and how to group them. The establishment of the corresponding CMDB is the core of all resources, and all resource cores must be managed in the CMDB. There are two prompts for the establishment of CMDB:

Usually the cloud platform will provide resource information in the SDK, you can directly query its corresponding instance information, and then write it into CMDB. There are many ways to do this. There are a lot of other people who have written it in Github open source, and you can just adjust it directly.

Try to choose the Go language for CMDB development. The concurrency and affinity of the Go language are relatively much better; in the case of Python, the concurrency is relatively poor, and the waiting time for calling the CMDB interface is relatively long. With the CMDB, when using Ansible or salt to adjust the machine, you can perform some sorting and grouping processing from the CMDB.

Operation and maintenance automation - release

The corresponding CMDB has been created, and the batch tool has also been prepared. After the batch tool is completed, the regular operation and release will follow. I believe that every operation and maintenance personnel will face the problem of version release. In the case of cloud version release, this place is very interesting.

For example, when a version is published, it happens that a machine is up. Is the content of the version released on the up-to-date machine the latest or old?

I offer a few handling tips.

Because of version control when publishing, some are gitlab or SVN. When publishing, send the first copy to a fixed path in advance, which can be OSS or other managed categories. Put it under OSS, this is the first copy. After the release, use the batch tool to send all the versions of your existing online machines that are currently surviving.

At this time, if automatic scaling is triggered and the machine is expanded, the expanded machine can pull the latest code directly from the fixed path at startup, which ensures that the latest version of the content is kept whether it is expanded or actively released.

Look at automated monitoring again, here refers to automatic registration and automatic de-registration. Due to the instability of the cloud, the machine may be created during capacity expansion. As a monitor, it cannot be manually configured and added to the monitor, and the delay is not fixed, and it cannot be processed in time at night, so it must have automatic monitoring capabilities.

Automatic registration: Like the most commonly used Zabbix, it must be automatically registered when starting the agent, registering its own content. Configure such as matadata, report the characteristic value to Zabbix server, classify from matadata, find different templates, so that different instances will be registered under different templates, so as to achieve automatic registration grouping.

Deregistration: The most difficult part is not registration, the most difficult thing to do is actually deregistration. It is recommended to make more use of the cloud monitoring of the cloud platform. With cloud monitoring, each instance or each resource operation will have an event, as long as the event is captured, the content of the event is analyzed, and then the server-side deregistration work is performed.
O&M automated deployment

We have to build an advertising platform, especially when deploying in multiple regions, it is very cumbersome. I may deploy it in Singapore, I may deploy it in the United States, or I may deploy it in Frankfurt. Therefore, in the automated deployment work, we must make good use of the orchestration tools provided by the public cloud. Alibaba Cloud is called ROS, and AWS is called CloudFormation.

The orchestration tool belongs to a special language. The content is a yaml file, which contains all the resources you use. Some environment processing, mirroring and some network creation in the resources can be pre-set in it, and can be directly realized after setting. One-click deployment of the basic environment.

Five Directions for Cost Optimization on the Cloud

We use the cloud because of its cost, but if the cloud is not used well, the cost will be higher than IDC.

In terms of cost savings, there are five directions to share with you:

1. The solution with the highest cost saving—preemptible instance. Preemptible instances are a type of on-demand instances, which have a certain discount compared to pay-as-you-go instances. Make full use of preemptible instances to save computing costs, up to 90% of computing costs can be saved.

Elastic stretch. Only after using auto-scaling can we think that our business is on the cloud. After you use auto-scaling, the cost can be reduced by at least 30%.
Make full use of the features of OSS, not everything needs to be stored on computing nodes. OSS actually has multiple billing forms: standard billing, low-access billing, archive billing, and cold storage billing. You can write an automated script or automated tool to check the access frequency of these data and convert them into different billing methods, which can greatly reduce storage costs.

control resources. It is also a considerable cost savings to regularly clean up idle resources.

Tag. Reasonable use of Tags to do a good job of grouping, grouping according to different teams, you can know the usage of each company, classify costs, and analyze optimization points.

Finally, I would like to introduce a SpotMax tool developed by us, which enhances the usage of Spot instances like scaling groups. Addresses issues such as what to do in the event of a Spot reclamation or what to do in an out-of-resource situation.

The most basic point of this function is that when the Spot instance is recycled, the resources are replenished in advance, and then added to the scaling, so that there will be no loss. The most basic function is that preemptible instances will notify you of the offline time in advance, so that you have a certain amount of time to add new resources to replace old resources. To go a little further, when you want to supplement, you may not be able to get the preemptible instance Spot. At this time, try to add an on-demand machine, and then go to the detection after the completion, and replace the on-demand instance when you can get the Spot.

The same is true for capacity expansion failures. Spot scaleup is used for capacity expansion. But if the expansion fails, some on-demand instances will be added. After the on-demand replenishment, the polling will continue, and the on-demand instances will be replaced when the Spot is available. This can not only ensure the stability of the service business, but also ensure that the cost of use is the lowest.

Based on this solution, we developed SpotMax by ourselves. Save up to 90% on SpotMax now.

The current Spot instance supports the development of Mobvista’s global advertising business, and has achieved very good results: the advertising platform is the number one in China and the top ten in the world, covering more than 200 traffic countries, and the daily request for advertising is about 100 billion Some model features are more than 10 billion, but the average response time of advertisements is basically below 50 milliseconds.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us