Network Monitoring Technology Behind the Smooth Experience During Double 11

By Tang Tang, a member of the Network R&D Department of Alibaba and the Former Postgraduate Tutor of BUPT. Tang Tang is currently involved in the R&D of network stability and has several patents related to networks and algorithms.

Released by Hologres

Background

During the 2020 Double 11 Global Shopping Festival, Alibaba’s cloud-native real-time data warehouse was first implemented in core data scenarios in real-time for the first time. This data warehouse is built based on Hologres and Realtime Compute for Apache Flink, and has set a new record for the big data platform. This article focuses on Hologres' best practice of successfully replacing Apache Druid in the Alibaba Network Monitoring Department. Hologres also supported the real-time network monitoring screen with milliseconds response time during Double 11.

Real-time Fault Discovery

At 00:00:00 on November 11, 2020, consumers entered their shopping carts, clicked the payment button, and paid for their orders. After one minute, there was an Alipay notification for the amount spent.

Hundreds of millions of people participated in the 2020 Double 11 Global Shopping Festival simultaneously with a record peak of 580,000 transactions per second. Buyers’ shopping experiences were as smooth as silk during the entire transaction process, but this could not be achieved without Alibaba's network capabilities. With the development of technologies and the increasing prosperity of the cloud and e-commerce businesses in recent years, network infrastructures have become increasingly large and complex. How can we ensure the stability of this expanded network and provide a smooth shopping experience for users on the cloud? It is a huge challenge for network system builders and operators.

Faults are inevitable, but the ultimate goal is to locate and fix faults quickly and prevent them if possible.

The ultimate goal of stability is to expose as few faults as possible to users. In 2015, Microsoft proposed the Pingmesh system, which became a de facto industry solution. However, due to some inherent defects, the time for fault discovery is too long. Since 2017, the networking R&D department of Alibaba has been developing the world-leading Aliping detection system. The real-time Aliping system has shortened Alibaba's fault discovery time to only seconds. The shortest latency from data collection and processing to screen presenting is several seconds. The time for alerting and fault location is at the minute level. Aliping monitors Alibaba's network conditions 24/7.

The following figure shows the core architecture of Aliping:
_00

As the core of fault discovery, the monitoring screen plays a vital role in displaying network conditions in real-time in the entire system. Every single undulating curve may represent a damaged user business. Therefore, it is a major test for the monitoring screen to quickly display network status for timely alert and discovery of network faults, and help users address problems. The following section lists the difficulties that the monitoring personnel may encounter while using the monitoring screen:

High Demand in Data Timeliness: The processed structured data, including alarms and monitoring, needs to be displayed to users in real-time 24/7. Thus, these users, such as GOC and monitoring personnel, can detect and deal with the network faults of Alibaba Ant Financial in time.
Complex Data Sources: There are numerous network data sources and business scenarios. They generate hundreds of GB of traffic monitoring data, tens of KBs of IDC network data, and other data with different volumes every minute. How we can conclude different business data into a monitoring system is a test for the overall end-to-end monitoring screen.
Multiple Data Metrics: A lot of data metrics need to be monitored, which can be a complex Online Analytical Processing (OLAP) system. How can we query the required business data from the monitoring screen with quick response based on business scenarios? This is a challenge for OLAP frameworks that process back-end data.

Technology Selection

For the monitoring screen, users’ browsing behavior are unpredictable, so the structured data cannot be computed in advance. It relies on OLAP technology to conduct real-time analysis, combine basic data, and present the results to users. The Aliping system is the application of OLAP technology, which presents fault data of different dimensions, such as IDC, region, DSW, ASW, PSW, department, and application, to users on the monitoring screen.

During the Aliping system implementation in 2017, we compared multiple OLAP databases. The section below lists several representative OLAP databases and their features:

Apache Hive: The underlying layer of Hive stores data in the Hadoop Distributed File System (HDFS) and splits SQL statements into MapReduce tasks to query data with low learning costs. Simple MapReduce statistics can be quickly implemented through SQL-like statements without developing special MapReduce applications. This is suitable for statistical analysis of data warehouses. However, the underlying layer is restricted by HDFS, so CUD operations cannot be performed. Meanwhile, Hive needs to synchronize data from existing databases or logs to the HDFS system, which is difficult to implement real-time incremental synchronization. Unfortunately, the query speed is slow, and the monitoring screen cannot respond in seconds.
Apache Kylin: Traditionally, OLAP databases are divided into relational OLAP (ROLAP) and multi-dimension OLAP (MOLAP) based on data storage methods. ROLAP stores data for analysis with a relational model. ROLAP has advantages and disadvantages. The advantages are small storage size and flexible query methods, but each query requires data aggregation calculation. To remedy the weakness, ROLAP uses column storage, parallel query, query optimization, and bitmap indexing technologies. The idea of a data cube in Kylin is to exchange space for time. By defining a series of dimensions, Kylin pre-calculates, and stores the results of the combinations of dimensions. If there are N dimensions, there will be 2N combinations. It's best to control the number of dimensions, or there will be disastrous consequences since storage surges along with increasing dimensions. This is unacceptable for massive network data and non-deterministic dimensions.
ClickHouse: ClickHouse is developed by a Russian company called Yandex and is specially designed for online data analysis. According to official documents, ClickHouse processes billions of records per day. It uses column storage, data compression, sharding, and indexing to split and distribute a computing task to different shards for parallel execution. The results are collected after computing is completed. It supports SQL and table queries, real-time updates, and automatic multi-copy synchronization. On the whole, ClickHouse is not bad but is not mature enough. It has to be discarded due to its insufficient official support, multiple bugs, and it is not used in Alibaba Group.
Apache Druid: Druid is a data storage system that provides a sub-second level of query response time for historical and real-time data. Druid supports low-latency data ingest, flexible data exploration and analysis, high-performance data aggregation, and simple horizontal scaling. It applies to analytical query systems with large data volume and high requirements on scalable capabilities. Druid stores hotspot and real-time data in the memory of the real-time node and historical data in the hard disk of the history node. The real-time + historical structure ensures that queries are processed in milliseconds. Our needs are met by Druid's high-speed ingest and quick query response time, and there are strong supports from the general computing engine team. Thus, in the early period, Druid was selected as the OLAP support system for the monitoring screen.

New OLAP Network Monitoring System

As the business becomes more complicated, a series of problems occur during the use of Druid:

Several critical faults occurred during data writing due to the rapid growth of data volume. It is caused by the Alibaba Group's cloud migration which introduced huge traffic.
Since the business becomes complicated and changeable, it is relatively complex for Druid to add dimension data.
Druid's query method is not friendly. It has its own query language, which lacks SQL support. As the result, a lot of time is occupied for learning the Druid product.
High concurrency is not supported, which is a disaster for big promotion activities. During previous two years of Double 11, we had to log some users off to ensure the availability of the monitoring screen.

As more problems are exposed, we are also looking for a product that can replace Druid and meet the needs of real-time OLAP multi-dimensional analysis scenarios.

We learned about Hologres from the best practices accumulated by other departments in the Alibaba Group. Hologres supports high-concurrency point queries in row storage data format and real-time OLAP analysis in column storage data format. This is very suitable for the network monitoring system; therefore, Hologres is selected. The full-procedure testing and massive data verification show that Hologres can meet our scenario requirements. Therefore, we applied Hologres into the production environment.

The figure below shows the data flow of the transformed OLAP system.

Kafka collects network-related metric data in real-time and writes them to Apache Flink for summary and processing.
Apache Flink writes basically processes real-time data to Hologres in real-time. Hologres provides a unified storage service.
Hologres is connected to the monitoring screen, which displays various metric changes in real-time. If data does not meet the expectations, alerts are sent in real-time. This allows business personnel to troubleshoot and solve problems immediately.

_01

Business Benefits

2020 was the first time that Hologres participated in the monitoring of AIS network faults during Double 11. The performance of Hologres met our expectations. The overall business benefits are shown in the following section:

Millisecond-Level Response to TB-Level Data

Time is the life of real-time monitoring; the sooner the fault is detected, the faster the bleeding can be stopped. How can we filter out corresponding data among TB-size metrics based on the complex combination conditions entered by users? How can we achieve the data filtering within sub-second (milliseconds) in OLAP? These are big challenges for many systems. When properly using Hologres’ indexing function and resources allocation, Hologres perfectly meets the needs of monitoring business’ timelines.

Support for High Concurrency

The monitoring screen of Double 11 often needs to query historical data and make alarm predictions based on historical data. In the past, the system could only support queries from dozens of users for only nearly ten days of data. However, Hologres can support hundreds of users' large-scale parallel queries, and it still hasn’t reached the upper limit. At 00:00, during 2020 Double 11, facing hundreds of times the usual data volume, monitoring curves worked as smooth as old ones, without any delay.

High Writing Performance

Druid didn’t perform well and is prone to data congestion with hundreds of thousands of data writing per second. Hologres can solve this real-time data ingestion problem easily.

Low Learning Cost

Hologres is compatible with Postgres and fully supports SQL. It is easy for new users to use without studying the syntax. Hologres is also compatible with existing BI tools. It can connect to the monitoring screen without any modifications, saving a lot of learning time.

The smooth shopping experience during 2020 Double 11 could not be achieved without Alibaba's network capabilities. The monitoring screen serves as the eyes that focus on Alibaba's network conditions. As the core of the monitoring screen, Hologres continuously empowers the monitoring screen. However, Hologres is still immature in some aspects and needs to be improved with transparent upgrades and stability. We are willing to grow together with Hologres and look forward to better performance in the 2021 Double 11 shopping festival.

Community

Network Monitoring Technology Behind the Smooth Experience During Double 11

Background

Real-time Fault Discovery

Technology Selection

New OLAP Network Monitoring System

Business Benefits

Read previous post:

Read next post:

Hologres

You may also like

Comments

Hologres

Related Products

Black Friday Cloud Services Sale

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution