Network O&M Observability and Automated Fault Detection-Cloud Network Well - architected Design

Overview

Summary

With the development of digital transformation, enterprises rely more and more on cloud computing technologies to perform business operations. Cloud network O&M is business-critical to the work efficiency and operation security of cloud platforms. It not only affects data transmission security, but also influences service availability.

Compared with traditional IT architectures, services and features in cloud environments are more complex and abstract. Traditionally, parameters and underlying resources are manually configured. As the number of parameters and influential factors rapidly increases, automation tools are required to assist decision making. Therefore, building an intelligent O&M system in the cloud is of great importance. The intelligent O&M system can efficiently identify and fix potential issues to ensure service continuity and stability.

The goal of O&M is to quickly locate, fix, and prevent potential failures and build networks with optimized architectures and performance. To achieve this goal, Alibaba Cloud designed the following solution:

Alerting: Deploy CloudMonitor to monitor system status in real time and trigger alerts when anomalies are detected. CloudMonitor minimizes service interruptions because it can quickly detect and respond to issues.
Inspection: Periodically run full dimensional inspections on networks to identify and fix potential risks. Periodic inspections help you reduce the risk of major accidents.
Observation: Use Artificial Intelligence for IT Operations (AIOps) methods to perform continuous observation on network environments. Tracking and analysis on key metrics help you discover the change trend and make plans in advance. In addition, you can make optimization suggestions and improve network stability and performance based on the key metrics.

Keywords

Network Intelligence Service (NIS): NIS provides a set of AIOps tools for you to manage the entire lifecycle of cloud networks from network planning to network O&M. For example, you can use NIS to perform traffic analysis, network inspections, network performance monitoring, network diagnostics, path analysis, and topology creation. NIS helps you optimize your network architecture, improve network O&M efficiency, and reduce network operations costs.
CloudMonitor: CloudMonitor is a service that monitors resources and Internet applications.
Virtual Private Cloud (VPC): A VPC is a custom private network that you can create on Alibaba Cloud. VPCs are logically isolated from each other at Layer 2. You can create and manage cloud service instances in your VPC, such as Elastic Compute Service (ECS), Server Load Balancer (SLB), and ApsaraDB RDS.
Elastic IP Address (EIP): An EIP is a public IP address that you can purchase and hold as an independent resource.
NAT Gateway: NAT gateways translate network addresses.
Application Load Balancer (ALB): ALB is an Alibaba Cloud service that runs at the application layer and is optimized to balance traffic over HTTP, HTTPS, and Quick UDP Internet Connections (QUIC). ALB is highly elastic and can process large volumes of Layer 7 traffic on demand. ALB supports complex routing. ALB is deeply integrated with other cloud-native services and is designed to serve as a cloud-native Ingress gateway of Alibaba Cloud.
Network Load Balancer (NLB): NLB is a Layer 4 load balancing service intended for the Internet of Everything (IoE) era. NLB offers ultra-high performance and can automatically scale on demand. An NLB instance supports up to 100 million concurrent connections, which is ideal for services that require high concurrency.
Classic Load Balancer (CLB): CLB distributes inbound network traffic across multiple backend servers based on forwarding rules. CLB helps improve the performance and availability of your applications.
Cloud Enterprise Network (CEN): CEN is a high availability network that runs on the global private network of Alibaba Cloud. CEN uses transit routers to establish inter-region connections between VPCs to allow VPCs to communicate with data centers and establish flexible, reliable, and enterprise-class networks in the cloud.
VPN Gateway: VPN Gateway provides secure and reliable network connections that connect enterprise data centers, office networks, and Internet clients to Alibaba Cloud through encrypted and private tunnels.
Express Connect circuit: Express Connect circuits are cables or optical fibers that connect data centers. Express Connect circuits are typically deployed and maintained by Internet service providers (ISPs). Express Connect circuits are classified into dedicated Express Connect circuits and shared Express Connect circuits based on the deployment mode.
Express Connect: Express Connect is a networking service that connects data centers to Alibaba Cloud. You can use Express Connect to establish high-speed, reliable, and secure private connections between data centers and cloud networks. Express Connect help you improve network communication quality and security because data transmission over Express Connect is trustable and controllable.
Virtual border routers (VBRs): VBRs are an abstraction of Express Connect circuits that are isolated and virtualized by using the Layer 3 overlay and vSwitch technologies in the Software Defined Network (SDN) architecture. A VBR is deployed between the customer-premises equipment (CPE) and a VPC to exchange data between the VPC and data center.

Design principles

We recommend that you take into consideration the following principles:

Alert-driven O&M response mechanism

Event subscription mechanism: Configure alert rules that are triggered at the specified time to notify you of the potential system anomalies, performance issues, or security risks at the earliest opportunity.
Emergency response to high-severity alerts: Configure an emergency response mechanism for high-severity alerts that require specific plans and owners to fix the alert.
Periodic audit in the event center: Configure a periodic plan to audit the historical events in the event center. Event analysis helps you identify error trends and potential risks, and take measures in advance to prevent service interruptions.

Troubleshooting mechanism for high-severity risks

We recommend that you perform periodic network inspections to identify and fix potential risks. You can build a network O&M system to monitor the network status and quickly respond to risks that may compromise network performance and security.

Observation-oriented network optimization mechanism

Make sure that traffic analysis remains enabled so that the system can continuously monitor and analyze network metrics such as the throughput, packet loss rate, latency, and user distribution. Such metrics help O&M engineers optimize the service architecture based on traffic status.
Use topology generators to help O&M engineers track the network status in real time and optimize the network structure.
Use network insight providers to monitor Internet status and detect network issues so that you can optimize Internet management.

Key design

Use alerts to detect and locate errors

Configure alert rules

Alert rules for system events

System events: System events include the failure events and O&M events of various cloud services. If you subscribe to system events, alert notifications are sent to you or a specified third-party system as soon as an event is triggered. You must configure the subscription scope of system events, including services, event types, event names, event levels, application groups, event content, and event resources.

We recommend that you enable all CloudMonitor modules that are related to system events. The system event framework designed by Alibaba Cloud ensures that you can receive and monitor business-critical alerts. This mechanism improves system stability and security because you are notified of important events at the earliest opportunity.

For more information about the system events supported by CloudMonitor, see Supported cloud services and their system events.

Network system events can be classified into the following types:

Bandwidth and performance limits

Over-limit events: The upper limit on private bandwidth, Internet bandwidth, ALB, CLB, or NLB bandwidth, or number of connections on ALB, CLB, or NLB is reached.
Packet loss events: Packets are dropped due to bandwidth exhaustion on ALB, CLN, VPCs, or NAT gateways.
Over-limit QPS and request events: The HTTP 503 error is triggered when the upper limit on the ALB QPS is reached.

Connect management and session control

Over-limit sessions and dropped connections: New connections are dropped because the number of sessions on the ALB or CLB instance has reached the upper limit or the number of new connections on the NLB instance suddenly increases.
Connection failures: The number of connection failures on the CLB or NLB instance suddenly increases.

Routes and network stability

Over-limit routes: The number of CEN routes or dynamically allocated BGP routes reaches the upper limit.
Network jitters: CEN or VPC network jitters.
Connection errors: Errors on Express Connect circuits or BGP connections.

VPN and IPsec events

Over-limit bandwidth and connections: The upper limit on VPN bandwidth and IPsec negotiations is reached.
Health checks: A VPN gateway or an IPsec connection passes or fails health checks.

Endpoint and connection management

Operations on endpoints: Accept, reject, add, or delete an endpoint.

Certificate issues

Certificate and security issues: The certificate of an SLB or VPN gateway expires.

Business alerts

Threshold-triggered events: If the conditions in a threshold-triggered alert rule are met, events are triggered. If you subscribe to threshold-triggered events, you can configure fine-grained custom alert notifications. For example, you can merge and denoise alerts and specify custom alert notification methods. You must configure the subscription scope of threshold-triggered events, including services, metrics, severity levels, and application groups.

We recommend that you configure fine-grained alert rules and thresholds for business-critical metrics in CloudMonitor. Then, the system can perform trend analysis and anomaly inspection to identify potential errors and risks. This is a powerful security measure for the O&M team to ensure service availability and improve user experience.

For more information about the monitoring metrics supported by CloudMonitor, see Appendix 1: Metrics.

Subscribe to alert notifications

Alert notifications are classified into Critical, Warning, Notification (Info), and Resolved based on the alert severity.

We recommend that you configure a proper notification method for each level of alerts. For Critical alerts that can cause direct and continuous impacts on your business, we recommend that you specify telephone call as the primary notification method and immediately respond to such alerts. For ignorable alerts that do not cause adverse impacts on your business, we recommend that you view and manage the alerts during a specific time windows on a daily basis. This way, you can fix the issues while focusing on business-critical agendas.

For more information about alert templates, see Manage notification templates.

Alerts triggered by system events

Subscribe to events in CloudMonitor: Log on to the CloudMonitor console, choose Event Center > Event Subscription, and then create a subscription policy to subscribe to system events.

Business alerts

Create alert rules
If you want to monitor the usage of cloud resources, you can create an alert rule. If the monitoring metrics of a resource meet specified alert conditions, CloudMonitor automatically sends alert notifications to you. This way, you can identify and resolve issues at the earliest opportunity.
You can create alerts rules based on CloudMonitor metrics or custom business metrics. To create an alert rule, log on to the CloudMonitor console, choose Alerts > Alert Rules, and then click Create Alert Rule.
Subscribe to threshold-triggered events
CloudMonitor allows you to configure custom alert notifications for event subscription policies. You can use the event subscription feature to configure custom alert notifications. For example, you can subscribe to threshold-triggered events, merge and denoise alerts, upgrade alert contact groups, specify custom alert notification methods, and push alert notifications to destination channels based on data templates in the JSON format.
To subscribe to events, log on to the CloudMonitor console, choose Event Center > Event Subscription, and then click Create Subscription Policy.

Manage alerts

Alerts triggered by system events

Events detected by CloudMonitor are displayed on the Event Center > Notification History page. O&M engineers can take measures to manage and fix issues based on the detailed information provided by the event center.

Critical alerts require immediate response to minimize the impact. For low-severity alerts, we recommend that O&M engineers check the event center on a daily basis to ensure system stability and performance.

Business alerts

To ensure business operations efficiency, you can view business alerts triggered by custom alert rules on the Event Center > Notification History page in the CloudMonitor console.

We recommend that you configure conditions based on your business requirements to enable Function Compute or automation scripts to automatically fix issues. Alternatively, you can manage alert on the Notification History page on a regular basis. This not only facilitates problem-solving, but also optimizes resource utilization through automation measures.

Use inspections to identify and eliminate potential risks

Configure inspections for different types of risks

Stability risks
In the design of a high availability (HA) architecture, if primary/secondary servers are improperly configured, switchover may fail. This may compromise system continuity and stability. In addition, improper resource deployment policies may spread the failure impact, which is also known as an expanded explosion radius. In such cases, more servers or components are affected. As a result, the overall service stability may be significantly reduced.
To prevent such risks, you can run inspections to optimize resource deployment policies and ensure that switchover can be implemented as configured. This helps you improve system disaster recovery and take measures to eliminate potential risks.
Security risks
Access control lists (ACLs) may fail to block unauthorized access due to coarse-grained filtering. Security groups may grant permissions to unnecessary ports and services. As a result, the risk of attacks is increased due to violations against the principle of least privilege (PoLP).
You can run inspections to thoroughly check ACLs and security groups to ensure that only authorized access is allowed to necessary destinations. This improves overall network security.
Performance risks
Network latency may be increased by performance bottlenecks or bypasses. Packet loss may occur if network traffic frequently exceeds the maximum bandwidth.
We recommend that you use inspections to monitor your network latency and scale out resources based on inspection results. This helps you ensure quality of service (QoS) even if the amount of data transfers increases.
Resource waste
Low resource unitization results in resource waste. If you select an improper billing method, spending on resources may unexpectedly increase, which reduces the cost-benefit ratio.
You can run inspections to optimize resource deployment policies and increase resource utilization. You can select a proper billing method based on detailed cost-benefit analysis to control your budget and increase the cost-benefit ratio.

For more information, see Network inspection.

Run inspections

We recommend that you run inspections on a regular basis, such as every week, to monitor your network status and identify and analyze potential issues that can reduce resource utilization. Continuous monitoring and assessment help you maintain a stable network architecture, reduce costs, and ensure service continuity.

To view weekly network inspection reports, log on to the NIS console, click Network Inspection in the left-side navigation pane, click View historical reports in the Newest Inspection Report column, and then click Re-start in the upper-right corner.

Assess the overall network status based on the pass rate: The health assessment of the network is made based on the pass rate of inspections. O&M engineers can quickly determine the overall network performance and identify potential issues based on the trend of report scores.
Handle risks by risk level: The inspection items are sorted on descending order of priority from the highest risk to the lowest risk. You can take different measures for different levels of risks based on the professional suggestions provided by the inspection reports. This process not only helps you efficiently handle high-risk issues that may compromise system stability, but also provides clear suggestions on how to optimize your network environment.

Handle potential risks

Examples:

Control costs
- EIPs: Run inspections to detect and release idle EIPs to prevent resource waste.
- CEN: Allocate inter-region bandwidth resources based on actual traffic volumes to prevent resource waste.
Improve stability
- Over-limit risks: Bandwidth exhaustions or insufficient resources specifications.
- Single points of failure (SPOFs) in a zone: If you deploy an ALB instance, an NLB instance, or a transit router in a single zone, instability issues may arise.
- SPOFs on connections: If you use only one Express Connect circuit, one GA acceleration one, or one VPN tunnel, connectivity issues may arise.
- Service unavailability: Service errors may occur.

Perform global network optimization based on observability

Use observation tools

Generate topologies — Virtualize the entire network

Network topologies display the connections and relationships between network resources in visualized charts. Network topologies help you quickly learn about the architecture of networks on Alibaba Cloud, verify network configurations, troubleshoot network issues, and perform centralized O&M on cloud network resources.

Topology	Displayed information
VPC	Resources, including ECS instances, vSwitches, and routers Routes, including network elements inside and outside VPCs and route tables
CEN	Transit routers worldwide, VPCs connected to transit routers, and transit routers connected to each other
SLB	SLB zones, virtual IP addresses (VIPs), EIPs, and security groups

Traffic analysis — Sort network traffic from multiple dimensions
The traffic analysis feature can be used to monitor real-time network traffic, analyze historical network traffic, and generate visualized time series charts in the NIS console based on analysis results. You can troubleshoot issues based on the traffic data and collected metrics.
- Internet traffic analysis: You can use this feature to analyze the traffic in each region based on different types of resources that are associated with public IP addresses, including the traffic of the public IP addresses that are associated with CLB instances, the traffic of the public IP addresses that are associated with ECS instances, the traffic of the public IP addresses that are associated with Internet NAT gateways, the traffic of EIPs, and the traffic of the EIPs added to the same Internet Shared Bandwidth instance.
- Hybrid cloud traffic analysis: You can use this capability to analyze the inbound traffic and outbound traffic that flow through VBRs that are connected to transit routers in hybrid clouds.
- Inter-region traffic analysis: You can use this capability to analyze the inbound traffic and outbound traffic that flow through transit routers across regions. The traffic data is displayed in the form of 1-tuple, 2-tuples, and 5-tuples.
- Intra-region traffic analysis: You can use this capability to analyze the inbound traffic and outbound traffic that flow through transit routers that are connected to VPCs within the same region.
- Internet NAT gateway traffic analysis: You can use this capability to analyze the traffic of Internet NAT gateways and generate visualized time series charts on the Overview page in the NIS console.
Internet quality — Impacts caused by Internet quality degradation
- Detect Internet quality degradation based on the round-trip time (RTT) and retransmission rate.
- Detect Internet quality degradation events, including the time range, ISP, area, and traffic volume.
- Detect the public IP addresses affected by Internet quality degradation.

On-demand observation

Network topology
In the NIS console, find the Network Topology module, select a network instance, and then click Generate Topology. This module also supports topology drilldowns that can obtain information from different network layers. This feature analyzes and virtualizes the resource allocation status of the network to facilitate network management and O&M.
1. VPC topologies: VPC topologies are categorized into resource topologies and route topologies. A VPC topology displays the topology of routes and correlations between resource entities deployed in VPCs. In the VPC topology, you can view the basic information about related network instances, analyze these instances, and analyze reachability.
2. CEN topologies: A CEN topology displays the intra-region and inter-region connections between the transit routers that are deployed on a CEN instance based on real-time configurations. In the CEN topology, you can view the connections between global cloud resources established by transit routers and view the basic information about related network instances. This helps you learn about and manage the cloud network in an intuitive way.
3. SLB topologies: An SLB topology displays the connections between listeners and backend server groups of an SLB instance. You can view the basic information about the network instances in an SLB topology and analyze these instances to check whether traffic is routed as expected.
Traffic analysis
You can use the traffic analysis feature of NIS to monitor real-time network traffic and analyze historical network traffic. The traffic analysis feature helps you analyze traffic based on the source IP address, based on the source and destination IP addresses, and based on the source IP address, source port, destination IP address, destination port, and protocol. You can use this feature to sort network traffic, such as the top N instances.
You need to separately enable the following features before you can use them: Internet traffic analysis, hybrid cloud traffic analysis, inter-region traffic analysis, and intra-region traffic analysis.
- You can enable the Internet traffic analysis feature for specific regions or public IP addresses. If you select a region, this feature is enabled for all public IP addresses in the region.
- You can enable the hybrid cloud traffic analysis feature for the specific VBR connections on transit routers.
- You can enable the inter-region traffic analysis feature for the specific inter-region connections on transit routers.
- You can enable the intra-region traffic analysis feature for the specific VPC connections on transit routers.
Insight provider
You can use insight providers to obtain real-time information about Internet quality assessment, learn about Internet quality degradation in a timely manner, and receive Internet quality events and event impact analysis.
When you create an insight provider, you must configure monitored objects for the insight provider. Ten minutes after the insight provider is created, the insight provider starts to collect resource traffic and push specific metrics. You can click the insight provider name to view the network quality assessment scores, Internet quality degradation events, and public IP addresses that are affected by Internet quality degradation events. Such information reflects the Internet quality and helps you make information business decisions and adjustments.

Analysis and optimization

Optimization based on network topology observation
1. A network topology shows the entire network architecture, which helps you obtain the architecture summary, path analysis, and resource allocation status.
2. Network topologies help you efficiently identify potential issues by using the following checks:
  - Redundancy check: ensures that you have a redundancy mechanism to prevent SPOFs.
  - Configuration check: checks whether your configurations follow the best practices and helps you correct improper settings.
  - Security check: checks for potential security risks, such as ports and services that do not need to be exposed.
3. We recommend that you take the following measures to manage low-utilization or idle resources:
  - Resource recycling: Release IP addresses and ports that are no longer used.
  - Configuration optimization: Optimization resource allocation and disable services that are no longer used.
Traffic and business optimization based on traffic analysis
1. Internet optimization
  Internet traffic analysis accurately identifies the geographic distribution of users. You can deploy services in popular areas to reduce network latency and improve user experience.
  Internet traffic analysis continuously monitors Internet status based on key metrics, such as the bandwidth utilization, source IP addresses, destination IP addresses, source ports, destination ports, protocols, and RTT. Such information not only helps you identify the peak hours of your business, but also provides proofs for capacity planning and traffic management. You can maintain high availability and stability for your business even during peak hours with high workloads.
2. Internal network optimization
  To optimize traffic in your internal network, we recommend that you detect the top N sources that generate the highest volume of traffic and perform drilldown analysis to identify and fix anomalies. This measure helps you prioritize key businesses and reduce performance degradation caused by non-critical businesses. Another measure is to inspect the TCP retransmission rate on a regular basis to assess the packet loss rate, which may compromise business continuity. You can make business adjustments based on the preceding observation results and improve network quality and reliability.
Internet issue identification based on insight providers
Insight providers provide information about client locations and ISP networks, use an intelligent baseline algorithm to check whether performance downgrade or availability downgrade events occur, and provide event details to help you perform troubleshooting, including traffic analysis and Internet probes. You can also view the RTT and traffic information and monitor Internet status in real time by using the Internet traffic source map. Such information helps you make Internet adjustments in a timely manner to prevent business loss.

Best practices

The best practices are developed based on the preceding design principles. The following best practice consists of three steps:

Check alerts and fix issues

Check alerts on a daily basis. Make sure that high-severity alerts are pushed to your mobile phone in real time.

Run inspections to eliminate potential risks

Run inspections on a weekly basis.

Observe and optimize

Choose a suitable analysis tool.

Scenarios

Network O&M alerting

Notifications of risks and anomalies: When events related to resource availability or performance issues occur, Alibaba Cloud pushes the events to the event center in the NIS or CloudMonitor console. Such events include instance performance degradation caused by excessive resource usage, business unavailability caused by packet loss in Internet connections, and instance subscription expiration. We recommend that you handle these events at the earliest opportunity in case business interruptions occur.
Automatic O&M: Alibaba Cloud defines the status of the events that are displayed in the event center of the NIS console. This helps you understand the status of system O&M tasks. New events and status changes of events are reported to CloudMonitor, which allows you to build an event-driven automated O&M system to meet your business requirements.

Network O&M inspection

When you deploy or maintain networks or resources, your network configurations may not meet the requirement for best practices if you are unfamiliar with the cloud services that you use. After continuous network optimizations, you may need to manage an excessive number of network instances. Configuring, verifying, and inspecting these resources require large amounts of manpower. To meet this challenge, you can use the network inspection feature, which can help you diagnose the network architecture and resources deployed in the network and provide network optimization suggestions.

Network O&M observation

Network topology analysis: Network topologies provide comprehensive information about network architectures to help you identify and optimize the deployment and communication between network nodes. Network topologies display the connections and relationships between network resources in visualized charts. Network topologies help you quickly learn about the architecture of networks on Alibaba Cloud, verify network configurations, troubleshoot network issues, and perform centralized O&M on cloud network resources.
Network traffic monitoring and management: You can monitor the traffic in your networks in the same console, which facilitates operations for O&M engineers. You can also use the traffic analysis feature to monitor real-time network traffic and analyze historical network traffic.
Internet quality assessment: You can run periodic or continuous tests to assess the Internet quality based on key metrics, such as the latency, packet loss rate, and jitters. The assessment shows the overall service performance, and you can take measures to improve user experience.