How Hologres supports large-scale deployment and operation and maintenance

1. Challenges faced by ultra-large-scale deployment

With the development of the Internet, the amount of data has grown exponentially, and a stand-alone database can no longer meet business needs. Especially in the field of analysis, a query may need to process a large part or even the entire amount of data, and the pressure brought by massive data becomes particularly urgent. At the same time, with the acceleration of the digital transformation process of enterprises, the timeliness of data has become more and more important. How to use data to better empower businesses has become the key to digital transformation of enterprises.

Big data real-time data warehouse scenarios often double the size of databases compared to databases: increased data volume (TB level, PB level or even EB level), higher complexity of data processing, faster performance, and simultaneous service and analysis Satisfied and so on.

Users who have used open source OLAP systems, especially those who build their own clusters through open source OLAP, have some profound experience, that is, the difficulties in deployment and operation and maintenance, including ClickHouse, Druid, etc., are faced with the following problems:

* How to meet the fast delivery and elastic scaling of the cluster
*How to define service availability indicators and SLA system
*Integrated storage and computing, difficult model selection and capacity planning
* Weak monitoring ability, slow fault recovery, lack of self-healing ability

At the same time, with the increase in scale, scale advantages and pressure from high-performance throughput, the difficulty of deployment and operation and maintenance of real-time data warehouses increases exponentially, and the system faces many challenges in scheduling, deployment and operation and maintenance:

* How to solve the scheduling ability to meet the requirements of second-level pull-up and elastic scaling capabilities of service instances under the scale of 10,000 units in a single cluster;
* How to solve the capacity planning, stability guarantee, and machine self-healing of large-scale clusters, and improve related operation and maintenance efficiency;
*How to achieve the dual requirements of timeliness and accuracy of instance and cluster monitoring, including how to complete problem discovery and minute-level problem solving within minutes

Thanks to Alibaba Cloud's strong research and development capabilities in cloud-native basic services, the real-time data warehouse Hologres solves these challenges through the construction of multiple core capabilities such as excellent architecture design and Alibaba Cloud's big data intelligent operation and maintenance platform, and provides users with A real-time data warehouse product with powerful performance, excellent scalability, high reliability, and maintenance-free.

This article will start from the ultra-large-scale deployment and operation and maintenance system construction, analyze the challenges faced by the ultra-large-scale real-time data warehouse and the targeted design and solutions, so as to achieve high-performance while supporting high-load and high-throughput, and achieve production-level High availability.

2 Design of large-scale scheduling architecture based on cloud native

With the rise of cloud technology, more and more systems have just begun to use Kubernetes as a container application cluster management system, which provides automatic resource scheduling, container deployment, dynamic expansion, rolling upgrade, load balancing, service discovery, etc. for containerized applications. Function.

Hologres has been optimized in advance at the beginning of the design architecture. It adopts the cloud-native containerized deployment method and uses Kubernetes as the resource scheduling system to meet the needs of ultra-large-scale nodes and scheduling capabilities in real-time data warehouse scenarios. The cloud-native cluster that Hologres relies on can support more than 10,000 servers, and a single instance can reach 8,192 nodes or even larger.

1 Tens of thousands of Kubernetes scheduling

Kubernetes officially announced that the maximum cluster size is 5,000 units. In the Alibaba Cloud scenario, in order to meet business scale requirements and improve resource utilization, the cloud-native cluster size must reach 10,000 units. As we all know, Kubernetes is a central node service that strongly relies on etcd and kube-apiserver. This block is where the performance bottleneck lies. Breaking through the scale of 10,000 units requires in-depth optimization of related components. At the same time, it is necessary to solve the problem of single-point failover speed and improve the availability of cloud-native clusters.

Through stress testing and simulating the pressure under tens of thousands of nodes and millions of pods, serious response delay problems were found, including:

* etcd has a lot of read and write delays, and has caused a denial of service situation. At the same time, due to space limitations, it cannot carry Kubernetes to store a large number of objects;
*API Server query latency is very high, concurrent query requests may cause backend etcd oom;
*Controller has a high processing delay and takes a long time to recover from abnormalities. When an abnormal restart occurs, the recovery time of the service will take several minutes;
*Scheduler has high latency and low throughput, which cannot meet the needs of daily operation and maintenance of the business, let alone support extreme scenarios of big promotions

In order to break through the bottleneck of the k8s cluster scale, the relevant team conducted a detailed investigation and found the cause of the processing bottleneck:

*It is found that the performance bottleneck is kubelet, which reports its full amount of information every 10s as a heartbeat synchronization to k8s. The amount of data is as small as a few KB and as large as 10KB+. When the number of nodes reaches 5000, it will cause writing pressure on kube-apiserver and ETCD.
*The storage capacity recommended by etcd is only 2G, and the object storage requirements of the k8s cluster under the scale of 10,000 units far exceed this requirement, and the performance must not be degraded;
*In the deployment of multiple API Servers used to support high availability of the cluster, there will be unbalanced load, which will affect the overall throughput;
*The original scheduler has poor performance and weak capabilities, and cannot meet the capabilities for scenarios such as mixed departments and big promotions.

In view of this situation, the following optimizations are made to achieve the scale scheduling of 10,000 units:

* etcd designs a new memory free page management algorithm to greatly optimize etcd performance;
* Solved the performance bottleneck of APIServer by implementing Kubernetes lightweight heartbeat and improving the load balancing of multiple API Server nodes under HA cluster;
*The service interruption time of the controller/scheduler during active-standby switchover is greatly shortened by means of hot standby, and the availability of the entire cluster is improved;
* Improve the scheduling performance of Scheduler by supporting equivalence class processing and the introduction of random relaxation algorithm

Three Hologres operation and maintenance system construction

1 Hologres operation and maintenance system overview

Aiming at the problems and pain points encountered in the OLAP system, as well as the operation and maintenance challenges under the pressure of ultra-large-scale deployment, and relying on the Alibaba Cloud big data operation and maintenance platform, we designed the Hologres operation and maintenance system to solve automation problems such as resource and cluster delivery. , cluster and instance-level real-time observability issues and an intelligent self-healing system to improve the SLA of Hologres to the level of production availability.

2 Cluster automated delivery

Hologres is completely designed and implemented based on the cloud-native approach. It decouples computing resources and storage resources through the separation of storage and computing. The deployment of computing nodes is deployed and pulled up through the K8s cluster. Through the self-developed operation and maintenance management system ABM, in terms of cluster delivery, we abstract the cluster design and separate the concepts of resource clusters and business clusters; for the delivery of resource clusters, ABM and the underlying platform are connected to create and manage resource clusters. Capacity maintenance; on the business cluster, ABM provides a deployment template based on the K8s concept, and quickly pulls up nodes such as management and control on the resource cluster to complete the delivery.

3 Observability system

The observable performance of the system helps businesses better manage cluster water levels and troubleshoot problems, thereby improving enterprise-level management and control capabilities. In terms of observability, it is not only necessary to reveal more simple and easy-to-understand monitoring indicators, but also to have a mature log collection system, so as to achieve simpler operation and maintenance, and only need to be responsible for business problems. Based on Alibaba Cloud's monitoring products and the observability requirements of Hologres, we designed the real-time monitoring capability of Hologres.

Metric monitoring system

In order to support detailed system capability observation, performance monitoring, fast problem location and debugging, Hologres supports a very rich metric monitoring system, which also puts forward very high requirements for the collection, storage and query of the entire metric link. For the monitoring link, Hologres chose the Emon platform developed by Alibaba. In addition to supporting the writing of billion-level metrics per second, Emon also supports automatic downsample, aggregation optimization and other capabilities; The core metrics are exported to cloud monitoring, which is convenient for users to monitor and observe instances and locate problems by themselves.

Log collection and monitoring

In terms of log collection, Hologres adopts the mature cloud product SLS, which can support central log checking and filtering; at the same time, considering that the log volume of Hologres is also very large, a sub-module and grading mechanism is adopted in collection to control costs. At the same time, it can well solve the needs of troubleshooting and auditing. At the same time, SLS also provides a monitoring solution based on keywords, etc., which can issue alarms for key errors, so as to facilitate timely handling of problems.

Meta warehouse-based availability monitoring

In the collection and alarm of metrics and logs, more problems are reflected in a certain module, and the above methods cannot fully answer the availability of a certain instance. Based on this, we built a Hologres operation and maintenance data warehouse to comprehensively judge whether the instance is working properly through multi-dimensional events and states.
Multi-dimensional data will be collected and maintained in the meta warehouse, including the meta data of the instance, the availability judgment criteria of each module in Hologres, the status of each module of the instance, and the event center, including operation and maintenance events, customer events, system events, etc.; While judging the availability of instances, the meta warehouse also provides various data for instance diagnosis and instance inspection. At present, the ability of the meta warehouse has been released as a slow query log. Users can use the slow query log to perform self-service problem diagnosis and tuning.

4 Intelligent operation and maintenance to improve product SLA
On the basis of perfect observability, in order to improve the speed of problem location and shorten the instance recovery time, that is, to improve the MTTR of Hologres, based on the basic capabilities and intelligent operation and maintenance solutions provided by the Alibaba Cloud big data operation and maintenance platform, we built a complete Hologres SLA management system and fault diagnosis and self-healing system.

SLA system

Based on the data and instance availability definitions of the Hologres operation and maintenance warehouse, we have established a Hologres instance-level availability management system. The instance availability data will enter the ABM SLI database. SLI triggers instance availability monitoring based on data and conditions. When the monitoring is issued, The diagnosis of the instance will be triggered, and the system will judge whether to perform self-healing according to the diagnosis result. If it is known that automatic recovery is possible, it will trigger self-healing and perform automatic recovery of the fault; if it is an unknown situation, it will trigger the generation of a manual work order. The single system will be followed up by people and gradually form a self-healing action.

Intelligent inspection

Intelligent inspection solves some hidden and non-emergency problems of clusters or instances, and avoids the accumulation of minor problems that cause qualitative changes that affect online stability; in addition to some clearly defined inspection items, intelligent inspection also introduces Clustering algorithms, etc., analyze system indicators, which will also help us discover some discrete nodes in the cluster and deal with them in a timely manner to avoid affecting the availability of the entire instance caused by problem nodes.

Intelligent diagnosis and self-healing

Intelligent diagnosis not only relies on the data of the operation and maintenance warehouse, but also relies on the algorithm support related to diagnosis, including log clustering, root cause analysis, etc., to cluster error logs and mark the clustering results. With the support of algorithms and engineering capabilities provided by ABM, instance diagnosis is already helping businesses quickly locate problems, improve the efficiency of problem solving, and shorten the MTTR of instances.

Four Hologres product-level operation and maintenance capabilities

In addition to the operation and maintenance stability guarantee of the Hologres service itself introduced above. On the Hologres product side, system stability is improved in various ways:

1. High availability architecture
High-availability architecture design is adopted to stably support the traffic peaks of Ali Group’s big promotions such as Double 11 over the years, and has experienced large-scale production tests, including

Storage and Computing Separation Architecture Enhances System Expansion Flexibility
Multi-modal replication solves the separation of data reading and writing, mainly including multiple copies to improve throughput, single-instance resource group isolation, and multi-instance shared storage high availability
The scheduling system improves the fast recovery capability of node failover

2. Diversified system observability indicators

In addition to the design of Hologres's own architecture, it also provides users with diversified observation indicators, real-time monitoring of cluster status and post-event review, without complex operations, and only needs to be responsible for the business:

Multi-dimensional monitoring indicators: real-time query of CPU, memory, number of connections, IO and other monitoring indicators, real-time early warning;
Slow query log: Diagnose, analyze and take optimization measures for slow or failed queries that occur through time, plan, cpu consumption and other indicators to improve self-diagnosis capabilities;
Execution plan visualization: Through a variety of visual display methods, run and execute analysis on Query, interpret operators in detail, and guide optimization suggestions to avoid blind optimization, lower the threshold of performance tuning, and quickly achieve the purpose of performance tuning .

Five summary

Through the analysis and targeted optimization of scheduling performance bottlenecks faced by large-scale scheduling, Hologres can complete instance delivery and scaling of 8192 nodes or even larger scale. At the same time, based on the cloud-native Hologres intelligent operation and maintenance system construction, it solves the problems of operation and maintenance efficiency and stability improvement faced by large-scale clusters and instances. It achieves high performance while handling throughput, realizes high availability at the production level, better supports business, and provides good support for the digital transformation of enterprises.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us