Enterprise Cloud Monitoring-basic Monitoring Solution for Enterprise Cloud

After more than a decade of development, the cloud has become central to new digital experiences. More and more customers are going to the cloud, and customers have evolved from small and medium-sized webmasters in the early days to today's real enterprise customers, covering almost all industries and emerging traditional ones.
In the real scenario of enterprise customers, customers often have massive resources, and have multi-level product forms of IaaS, PaaS, and SaaS . Multi-cloud, multi-account, and hybrid cloud have become the norm.
Monitoring Challenges for Enterprise Customers After Going to the Cloud
In this context, how to achieve effective management and control of resources for enterprise customers on the cloud is facing huge challenges:
The first is scale and complexity

Some customers have thousands of cloud resources, and some use hundreds of products. How are the massive resources running, how is the water level, and how is the stability? Are there serious wastes or serious shortages?

For users with a single account, how do achieve effective isolation and improve security while providing convenience?

For users with multiple accounts, how do achieve effective unification of monitoring management and improve the efficiency of operation and maintenance management ?

The second is how to integrate complex forms such as multi-cloud and hybrid clouds.

Most of today's enterprise customers use a combination of multi-cloud and hybrid cloud addition, the monitoring demands of enterprise customers are not only for infrastructure monitoring, but also for application performance monitoring, application availability monitoring, and customer business operation monitoring. Customers also want a simple and efficient way to unify monitoring at different levels. Enterprise cloud, from infrastructure monitoring, to application availability, application performance monitoring to business indicator monitoring, how to achieve effective unification?. When customers go to the cloud, how can the original IT system be connected with the monitoring system on the cloud, and how can multiple clouds collaborate and efficiently unify?

These are some of the typical problems faced by customers in the cloud today.
Cloud monitoring enterprise cloud monitoring solution
In this context, cloud monitoring has carried out large-scale upgrades and reconstructions of product functions in the past few years, focusing on the launch of many product functions specifically for the monitoring needs of enterprise customers, forming a combination of basic version + enterprise version. Product function system.
Basic version

It includes basic functions such as host monitoring, cloud product monitoring, container monitoring, application grouping, and alarming to meet the basic monitoring needs of customers.

Enterprise Edition,

At the same time, we have also launched a series of functions that meet the monitoring needs of enterprise customers, forming a relatively complete enterprise version of the cloud monitoring product function system.

Including multi-account, multi-cloud, and hybrid cloud enterprise monitoring, one-stack monitoring that supports basic + application + business combination, real-time data export service, second-level monitoring service for some products, and resource water level report for enterprise customer resource optimization, etc.


The basic version + enterprise version are combined to meet the monitoring scenarios of different levels of customers. These different scenarios and solutions will be introduced one by one later.
First of all, I personally think that one of the most important scenarios that enterprises face after migrating to the cloud is multi-account. More and more large and medium-sized enterprises choose to relocate the business of multiple departments or different projects to the cloud, and the resources on the cloud grow rapidly.

An Alibaba Cloud account is a container for resources on the cloud. If a customer deploys the resources of all services on the cloud under a single account, it may face many challenges.

1. For example, the company's internal innovation incubation projects or core businesses usually require confidentiality. If all projects or businesses are deployed under the same account, the requirements for strong resource isolation and strict confidentiality cannot be achieved.

2. In terms of security management, the security level requirements of testing, production environments, and core and non-core businesses are often different. Deployment in a single-account environment will make security management complicated and inflexible.

3. In terms of risk management and control, if the staff accidentally exposes the AK information of the account to the public network, it may lead to the leakage of all business information in the account or even paralyze all business.

Therefore, more and more large and medium-sized enterprises choose multi-account environments to deploy cloud services to meet the demands of strong resource isolation, secure and flexible management, and decentralization of security risks.

In a multi-account environment, the central management team of an enterprise needs to manage users of each account, assign permissions, and manage the infrastructure resources of each account. The central security team needs to understand the overall security situation of the company at a glance and conduct centralized security Control. How to prevent the central team from frequently logging in and out between different accounts, how to achieve network connectivity between different accounts, and how to improve management efficiency in a multi-account environment?

At this time, we expect to manage multiple accounts on the cloud and the resources within the accounts in an automatic and efficient way. Alibaba Cloud's resource catalog products provide us with solutions to the above problems.

The resource catalog is our service on the cloud to centrally manage and manage multiple accounts of an enterprise and cloud resources within the accounts. The core capabilities of the resource catalog include the establishment of an organizational structure, the management and control of account permissions within an organization, and the centralized management of cross-account resources. Through the rapid creation of accounts and the grouping and hierarchical management of accounts, we can easily build an account structure that matches business management on the cloud.

If the business management structure is complex, you can create a directory structure with up to 5 layers to meet your needs. When business changes, we can flexibly adjust the directory structure to adapt to business changes.

The ability to create accounts in seconds and delete them with one click makes the management of the organizational structure flexible and burden-free.

The current resource directory has been integrated with more than ten other cloud services of Alibaba Cloud, covering identity permissions, compliance auditing, security, and operation and maintenance scenarios. The centralized teams of enterprises, such as central operation and maintenance, security, and audit teams, can be linked with the resource directory. The integrated cloud service realizes the centralized management of cross-account resources, eliminating the need to log in, log out, view, and operate between multiple accounts, meeting the overall security and audit compliance requirements of the enterprise, and improving the efficiency of cloud business management.

The resource catalog is a free service, and you only need to pay for the cloud resources activated under your account. The resource directory makes multi-account organization and resource management safer and more convenient. Welcome to the resource directory console operation experience.

For the operation and maintenance scenarios, Cloud Monitoring also integrates with the resource directory to form a monitoring solution for multiple accounts.
In response to the multi-account unified monitoring needs of the enterprise center operation and maintenance team, CloudMonitor provides two solutions:
One is the unification of data across accounts
It is based on the rights management of the resource directory, which integrates the monitoring data of different accounts together, views the data of all sub-accounts under one account, configures a unified monitoring panel, and configures a unified alarm.
In this mode, cross-account data aggregation can be easily achieved, and all resources of the entire enterprise on the cloud can be understood from a higher level, including the number of resources, resource water level, and resource stability.
It can also be easily grasped, the resource quantity comparison between different accounts, the resource water level comparison, the stability comparison, etc.
Another situation is that the customer's own demand for data collection is not very strong, that is, the monitoring and alarming of different accounts are managed separately under different accounts of different teams, but there is no need to repeatedly create contacts under different accounts. In this case, CloudMonitor is designed to uniformly create a contact group under the management account, then create a webhook for the contact group, assign the webhook to other sub-accounts, and configure this webhook for alarm notification under other sub-accounts, In this way, the alarm notification of the sub-account will be sent to the contact group under the management account through the webhook, so as to achieve unified monitoring at the alarm level. At the same time, the alarm information of all sub-accounts can also be seen under the management account.
In this case, the disadvantage is that the alarm policies need to be configured separately under different accounts, and a unified monitoring panel cannot be achieved. Data aggregation across accounts cannot be implemented.
It is worth mentioning that both the Tokyo Olympics and the Beijing Winter Olympics adopted a data unification solution to unify the data of multiple accounts to improve high-quality monitoring and escort services for customers.

It can be said that the effective integration of cloud monitoring and resource directory provides a flexible and easy-to-use solution for enterprise customers' multi-account monitoring and management, and perfectly solves the needs of customers for unified monitoring of multiple accounts.
Single Account Resource Group
In addition to the multi-account model, some companies also choose the single-account model, especially agile Internet companies. Like Aunt Qian.

Under the large-single account model, customers can manage all resources under one account and can obtain management convenience. However, a large number of resources make enterprises face huge resource management challenges.

Take a very common example.

As resource manager for Project X, I need to know:
How many resources did Project X use? How to easily view and count the resources of project X?
Who should have access to Project X? Do members of the project have the same permissions for different responsibilities? For example, should operation and development have different permissions?
Yes, these problems are very common, and you may have encountered them too. We need an efficient resource management approach to help us address these issues.
Alibaba Cloud's resource group is such a product. It provides a mechanism for resource grouping management to solve the complexity of resource grouping and authorization management within a single Alibaba Cloud account.

For enterprises that adopt the project management method, resources can be grouped according to the project to which the resource belongs, and the project members can be granted the permissions of the corresponding resource group. Of course, the division of resource groups can be flexibly divided according to the needs of enterprise management. It can be divided by project, application + environment, or by department or business.

For the single-account multi-resource group mode, CloudMonitor can automatically create a corresponding application group for each resource group by connecting resource groups and application groups, and this application group only has RAM sub-accounts that have authorized resource group permissions. Only access authorized RAM sub-accounts can view the resources in the group, the water level, and a load of the resources, and can create alarms for these resources.

The monitoring and management of resource grouping can be easily realized.

At the same time, enterprise cloud monitoring also provides a monitoring dashboard from the perspective of resource groups, which can provide an aggregated perspective of resource group dimensions; for example, resource data of resource groups, maximum load, average load, etc.

In addition, Enterprise Cloud Monitoring also provides a water level evaluation report of the resource group dimension, and you can view the load comparison analysis from the perspective of the resource group. You can view the resource quantity grouping and load status report of all resource groups from an overall perspective.

The films of the market and water level reports will continue to be mentioned later.

In short, based on the method of resource group + application grouping, it can provide customers with the supervision and control capability of resource grouping. Better monitoring and management of large-scale resources under a single account of enterprise customers.
Mass resource monitoring solution based on application grouping:
Here, the best practices for large-scale resource monitoring are introduced: In the context of massive resource scale, users often face the problem of how to monitor these resources quickly. Application grouping based on cloud monitoring can easily meet the needs of large-scale resource monitoring. The application grouping of cloud monitoring provides flexible and powerful resource grouping capabilities. It can quickly create dynamic groupings by resource group, label, fuzzy matching by resource name, etc., and combine grouping with alarm templates, which can quickly complete the coverage of large-scale resource monitoring. At the same time, through the blacklist strategy, some unnecessary Monitored resources, such as TEST-related resources, etc. Based on dynamic grouping, when users add resources, release resources, and change resources, they do not need to manually maintain their corresponding monitoring rules. Greatly improve the efficiency of monitoring and management. It can enable users to quickly get started in the early stage of cloud migration, and efficiently complete the monitoring and coverage problems after the business is migrated to the cloud, so that the cloud can be used with more confidence. If the application group is created through the resource group method , it can also perfectly inherit the resource group permissions of the sub-account to meet the scenario of group isolation monitoring and management. At the same time, cloud monitoring also provides global alarm capabilities, such as one-click alarm, user dimension alarm and other functions, which can allow new users to quickly add resource monitoring in a fool-like manner after they go to the cloud, greatly improving efficiency. Aunt Qian: Based on resource groups and tags, dynamic application grouping is recommended, combined with alarm templates and alarm blacklist policies, to achieve large-scale and personalized monitoring.
Multi-cloud Hybrid Cloud Scenario
Another scenario for enterprise customers to go to the cloud is multi-cloud hybrid cloud. According to Gartner's report: More than 81% of enterprises choose multi-cloud and hybrid cloud to avoid locking and using one cloud service. From an economic and stability perspective, multi-cloud hybrid cloud is a more preferred choice for enterprises.
In this context, the demand for multi-cloud and hybrid cloud to open up integrated monitoring is naturally derived. We also sorted out customers to get through various scenarios.
To sum up, users have multiple sub-scenarios in the two scenarios of integrating the alarm system on the cloud and integrating the data on the cloud.
For alarm integration, cloud monitoring provides an alarm webhook, which can easily integrate offline alarm information into the cloud.
The scenario of data integration to the cloud, cloud monitoring converts offline data into Prometheus indicators through ArgusAgent , and integrates it into cloud monitoring for unified display and unified alarm.
There are also quite a few customers who want to integrate the data or alarms on the cloud into the off-cloud system or the third-party monitoring operation and maintenance system . The cloud monitoring is also open, providing alarm webhook, data query api , data export api and other functions in one of the open ability. To meet customer demands and enhance the experience. In the second-party system, both arms and sls provide the function of integrated cloud monitoring. It satisfies the different monitoring demands of enterprise customers very well.
As for the integration of third-party cloud manufacturers, the principle is the same, that is, it supports the integration of the data of the third-party cloud manufacturers into Alibaba, and also supports opening.

Put it a little bit, the MetricListMetricLas originally provided by cloud monitoring is a query API, which is oriented to on-demand query scenarios, does not support high concurrency, and cannot well support scenarios such as real-time export of full monitoring data.
Based on this, we provide real-time data export services in enterprise cloud monitoring and provide APIs similar to real-time consumption to better meet the multi-cloud and hybrid cloud integrated monitoring scenarios.

After all, what matters is the customer experience and avoiding morphological data silos.
Basic + application + business one-stack monitoring
Generally speaking, enterprise customers have another requirement, that is, they need a set of systems to meet the multi-level monitoring of foundation, application, business, and experience.

In enterprise cloud monitoring, we reintegrated the original SLS log monitoring and custom monitoring and formed a set of business monitoring functions including SLS log, local log, local Prometheus, local custom monitoring, and other functions.

Combined with the original basic monitoring and application monitoring, a one-stack monitoring capability is formed.

The business monitoring of the new version of CloudMonitor has the following features:
1. Local log monitoring: It does not collect full logs, but only collects indicators, which can greatly reduce costs. This function is similar to the group's Sunfire. Students who have used it should be familiar with it.
2.Enhanced SLS log monitoring: support cross-log store, cross-region aggregation metrics,
3. Prometheus indicators: Automatic discovery based on application grouping, only ArgusAgent is required, no other components need to be installed, it can meet custom indicators, combined with Prometheusexporter, it can also meet the monitoring of common middleware such as JVM, spring, Nginx, and tomcat. And offline MySQL, Redis, and other middleware monitoring.
4. Enhanced custom monitoring function: still report through CLI or SDK, support Prometheus protocol, support PromQL alarm, simple and flexible
5. All indicators support Grafana unified display
6. All indicators support PromQL unified alarm
Here is a best practice on how to monitor JVM|Tomcat|Spring|nginx|redis|mysq and other components, you can refer to
Another common operation and maintenance monitoring scenario for enterprises is big promotion escort, Enterprise cloud monitoring also launched a data source that supports the Promehteus protocol, as well as hosted Grafana. This data source supports the direct opening of cloud monitoring as a promethesu data source, which can be connected to offline grafana and other display systems through the Internet or vpc network. It is with this data source as the core that we support the unified collection of cloud product data, business monitoring data, offline data, and other cloud and cross-account data. It effectively supports the demand for unified display of multi-cloud, hybrid cloud and multi-account on the market. , we also added rich dimensions to the cloud product monitoring data of this data source , improved the data accuracy, and supported second-level monitoring data for some products. At the same time, we also provide a large number of system preset dashboards, such as cross-cluster container monitoring dashboards, cross- reigon and cross-account RDS monitoring dashboards , and EIP and ECS dashboards that aggregate public network traffic. It also provides a monitoring and escort dashboard that can be generated from a business perspective with one click by specifying business labels. It greatly reduces the difficulty for customers and TAM escort students to configure the escort market. So far, nearly a thousand enterprise customers in different industries have used this function.
Resource optimization needs
As mentioned repeatedly, today's enterprise customers usually have a large scale of resources. An important feature of the cloud is that with the extreme elasticity of the cloud, customers have great convenience in obtaining resources, which allows users to easily obtain massive amounts of resources . Resources, for example, Dingding was broadcast live during the epidemic, and the scale of resources soon reached 100,000. Not surprisingly, there is a lot of waste of resources and idle resources. In the context of the epidemic, a large number of customers are reducing costs and increasing efficiency. In the past two years, customers have paid more attention to cost and efficiency, and customers want to know comprehensive information about the multiple resources they have. I want to know which resources are idle, which resources are tight, and how the customer's resource level compares with that of the same industry or industry. Or what is the approximate water level in the industry. to achieve a better balance between cost and efficiency. In cooperation with the algorithm team, CloudMonitor launched the enterprise resource water level report function last year, which can statistically analyze the resource water level of each resource of the customer, including the maximum, minimum and average P900599, etc., and provides multiple dimensions such as by product, by business, by resource group, etc. The statistics can help customers fully understand the operating water level of different resources and different businesses. And through intelligent algorithms, multi-dimensional resource water level radar charts, industry comparisons and other rich analysis reports are formed, allowing customers to fully understand the overall situation of the resources under their names. Currently, the report can support daily, weekly and monthly statistics, such as generating a report every week. And the past historical reports will be saved to form a cross-year trend of resource retention quantity, resource water level trend, alarm volume trend , etc. In this way, customers can get a more comprehensive resource analysis report. Users can enter from the new version of the cloud monitoring console " Enterprise Cloud Monitoring " resource water level report page to generate a report.
Website monitoring and dial testing
In addition to the above-mentioned infrastructure monitoring scenarios, some customers also provide external services and hope to monitor the stability of their services. For example, customers want to know about the delays and success rates of access to their own websites from customers in different geographical operator networks. Find out in time the accessibility of your website domain name in different regions , and hope that you can find out early when the domain name is hijacked. Companies with multinational businesses also want to know the global access of their business, for example, whether a customer's eip is accessible in Europe and whether it is accessible in Australia. Whether it is accessible from the networks of other third-party cloud vendors. On the basis of the original site monitoring, cloud monitoring has promoted network analysis and monitoring services, providing nationwide coverage of operator detection points, as well as detection points of major overseas cloud manufacturers, which can provide customers with multi-protocol, nationwide and global coverage. network accessibility monitoring and analysis. In the past few years, website monitoring has found a large number of problems with the operator's network, CDN performance problems, etc. Recently, a customer reported that the use of our CDN network was unstable, and the TSL handshake time was worse than that of a friend . After repeated investigations, it was finally found that the CDN using tengine 's LUA module had occasional GC time problems. After optimization, it will also improve the experience of our cdn products. In addition, our classmates who are responsible for after-sales and work orders often encounter such problems. Customers report problems with the network. There is no problem in our investigation. It needs to be repeatedly captured and checked by both parties. Usually, a problem takes several weeks . After investigation, most of the problems are still sinking into the sea, and there is no result. In the website monitoring of cloud monitoring, we provide a point-to-point network troubleshooting function, which can detect problems in detection, manually initiate point-to-point tracerotue , mtr , and then give feedback after getting the results , which can greatly improve the efficiency of troubleshooting. Success rate. It can also reduce the proportion of customer work orders .
Website Monitoring Best Practices
The main functions of the website monitoring and analysis product were introduced earlier. Network analysis and monitoring are mainly oriented to the userability of the external network . Let's take a look. Part of the use of cloud monitoring availability monitoring (application grouping) + network analysis and monitoring (original site monitoring), combined with the availability monitoring of the combination of internal and external networks. At the same time, the detection result data of availability monitoring and website monitoring can also be aggregated into enterprise cloud monitoring, which can form a more comprehensive stability market. For example, in a leading catering industry, the availability detection results of thousands of stores are gathered together, forming a global stability market in enterprise cloud monitoring. It greatly reduces the expenditure on monitoring operation and maintenance , and also greatly improves the efficiency of monitoring and management in the face of complex forms.
The overall picture of cloud monitoring
Finally, let's take a look at it as a whole. The current overall function of cloud monitoring is a big picture.
Finally, to sum up
Cloud monitoring is committed to enabling customers to make good use of the cloud, Cloud monitoring is the monitoring infrastructure after customers go to the cloud. It has out-of-the-box functions, flexible and powerful functions, and is open. It can be easily integrated with other two-party and three-party monitoring systems. Cloud monitoring provides integrated monitoring capabilities from basic to application to business to distribution to others. Can provide customers with a good choice. Cloud monitoring supports enterprise scenarios such as hybrid cloud, multi-cloud, and multi- account .

