Exploration and Practice of Database Governance

Background

In a distributed system architecture, business traffic is end-to-end. Each request goes through many layers of processing, such as from the portal gateway to the Web Server, to calls between services, and to storage such as the service access cache or DB.

Database is a very important part of our system. Therefore, whether in terms of stability governance or in scenarios such as development efficiency improvement, database related governance capabilities are required by our system.

The following are some typical database related governance scenarios:

• A system provides a query interface externally, and SQL statements involve multiple table joins. In some cases, slow queries can be triggered, which can take up to 30 seconds, ultimately leading to a full DB connection pool/Tomcat thread pool and overall application unavailability.

• The application has just started, but because the database Druid connection pool is still initializing, a large number of requests have already been made to enter, quickly causing the Dubbo thread pool to become full. Many site cards are in the process of initializing database connections, resulting in a large number of business requests reporting errors.

In the full link grayscale scenario, due to the new application version changing the content of the database table, grayscale traffic caused data confusion in the online database, and business students manually corrected the online data overnight.

• At the beginning of the project, no consideration was given to the performance of SQL. With the development of the business and the increase in the number of users, SQL with legacy online interfaces has gradually become a performance bottleneck. Therefore, effective SQL insights are needed to help us identify legacy SQL and optimize performance in a timely manner.

• The relatively long processing time of SQL statements leads to a large number of slow calls on the online business interface. It is necessary to quickly locate the problematic slow SQL and isolate it through certain governance measures to quickly recover the business. Therefore, when microservices access the data layer, real-time SQL insight can help us quickly locate slow SQL calls.



In fact, for most back-end applications, the bottleneck of the system is mainly limited to databases, and of course, complex businesses cannot be separated from database operations. Therefore, database issues are also the highest priority work, and database governance is also an essential part of microservice governance.

Common scenarios related to database governance

The following summarizes some common scenarios and capabilities in database governance when microservices access the database layer.

Overview of Database Governance in the OpenSergo Domain

Slow SQL governance

Slow SQL is one of the deadly factors that affect system stability. The occurrence of slow SQL in a system can lead to CPU, load anomalies, and system resource depletion. Severe slow SQL may drag down the entire database and pose a risk of disrupting online business. Possible reasons for slow SQL in online production environments are as follows:

• Hardware reasons such as slow network speed, insufficient memory, low I/O throughput, and full disk space.

• No index or invalid index.

• Too much system data.

• SQL performance was not well considered at the beginning of the project.

MSE service governance provides a scenario based solution to common online slow SQL problems.

• SQL Insight

MSE provides second level SQL call monitoring:

We can observe real-time data from application and resource API dimensions (down to the second level), and MSE also provides a TopN list of SQL statements. We can quickly identify SQL statements with high RT and quickly locate the root cause of application slowdown.

Through the SQL insight capabilities provided by MSE, we can effectively analyze whether SQL statements are written reasonably, and whether the concurrency and RT of SQL execution meet the expectations of system performance. Based on these SQL insight data, we can effectively evaluate the overall performance of the system, providing an important basis for the configuration of flow control degradation rules.

• Downgrading of SQL flow control

Based on SQL statements automatically identified by MSE, we can configure flow control or demotion rules for applications with slow SQL statements in the number of threads dimension. When slow SQL calls occur, we can limit the number of SQL statements executed at the same time, preventing excessive execution of slow SQL statements from depleting resources.

Regarding the SQL flow control degradation capability of MSE, MSE supports four rules: configuring flow control, isolation, fusing, and hot spot current limiting.

1. Flow control: Configure flow control rules for the service interface through flow control capabilities to allow requests within the capacity range to pass, while redundant requests are rejected. This is equivalent to the role of an airbag, effectively ensuring that the flow of SQL request access is controlled within the threshold of system capacity.

MSE will provide SQL insight into the aggregation of database and table dimensions later. Based on this ability, we can control the flow of specified databases and tables within the estimated capacity range.

2. Concurrency isolation: When the traffic is approximately stable, the number of concurrent threads=QPS * RT (s), where RT increases and the number of concurrent threads increases, indicating a buildup in service calls. Using the service concurrency isolation capability provided by traffic management to configure a limit on the number of concurrent threads for important service calls is equivalent to a "soft insurance" to prevent slow SQL or unstable services from overcrowding normal service resources.

3. Fuse degradation: During peak business hours, some downstream service providers encounter performance bottlenecks in accessing a large amount of data, resulting in a large number of slow SQL statements, and even affecting the business. We configure automatic fuse rules for database access to some non critical services. When the slow call ratio or error ratio within a certain period of time reaches a certain condition, the fuse is automatically triggered, and subsequent service calls directly return the results of Mock. This can ensure that the calling end is not dragged down by accumulated data access requests, thereby ensuring the normal operation of the entire business link.

4. Hotspot flow control: Automatically identify the parameter values of the TopN access heat in the SQL request access parameters through the hot spot parameter flow control capability, and perform separate flow control on these parameters to avoid single hot spot access overload; And you can configure separate flow control values for some special hot spot accesses (such as extremely popular flash purchase items). The parameter can be any condition with business attributes in SQL access, such as the value of the following tid parameter.

SELECT * FROM order WHERE tid = 1$

Connection Pool Governance

Connection pool governance is a very important aspect of database governance. Through some real-time indicators of link pools, we can effectively identify risks in the system in advance. The following are some common scenarios for connection pool governance.

1. Establish a company in advance

In the scenario of application release or elastic expansion, if the connection in the newly started instance has not been established yet, but the instance has already been started and the Readiness check has passed, it means that a large amount of business traffic will enter the newly started pod at this time. A large number of requests are blocked in the connection pool to obtain connections, resulting in a full service thread pool and a large number of business requests failing. If our application has the ability to establish connections in advance, it can ensure that the number of connection requests is above minIdle before the traffic arrives, and cooperate with the ability to warm up small traffic, which can solve the aforementioned headache of cold start.

2. "Bad" connection culling

Sometimes there are problematic connections in the connection pool, which may be caused by jitters in the underlying network, or problems such as slow execution of services, deadlocks, etc. If we can discover abnormal connections in a timely manner from the perspective of connection pooling, and eliminate and recycle them in a timely manner, we can ensure the overall stability of the connection pool, which will not be dragged down by individual problematic business processing or network jitters.

3. Access Control

Theoretically, not all database tables can be accessed casually. At some times, some important tables may be in a write prohibited or read-only state for some less important services. Or, when the database shakes or the thread pool is full, we want to reduce some time-consuming database reading SQL execution, or when there are tables with sensitive data, only one application is allowed to perform read and write access. Then we can use dynamic access control capabilities to issue access control rules in real time to achieve access control for black-and-white lists such as individual methods, application SQL oriented database instances, and table read/write bans.

Database grayscale

In the microservice architecture, the dependencies between services are complex, and sometimes the release of a function relies on multiple services being upgraded and launched simultaneously. We hope that small traffic grayscale verification can be performed on new versions of these services simultaneously, which is a unique full-link grayscale scenario in the microservices architecture. By building an environmental isolation from the gateway to the entire back-end service, grayscale verification can be performed on multiple different versions of services. Through the shadow table method, MSE allows users to achieve full link grayscale at the database level without modifying any business code.

Summary

These are the previews of a database governance capability that MSE is about to launch. From an application perspective, we have organized and abstracted some of our practical experience in stability governance, performance optimization, efficiency improvement, and other aspects of accessing and using databases. For every back-end application, databases are undoubtedly the top priority, and we hope that through our database governance capabilities, It can help everyone better use database services.

OpenSergo, the standard for service governance

Q: What is OpenSergo?

A: OpenSergo is a set of open, universal, distributed service architecture oriented service governance standards that cover the entire link heterogeneous ecosystem. It forms a common standard for service governance based on industry service governance scenarios and practices. The biggest feature of OpenSergo is that it defines service governance rules with a unified set of configurations/DSLs/protocols, and faces a multilingual heterogeneous architecture to achieve full link ecological coverage. Regardless of whether the language of microservices is Java, Go, Node.js, or other languages, whether it is standard microservices or mesh access, from gateways to microservices, from databases to caches, and from service registration and discovery to configuration, developers can uniformly manage and control each layer through the same set of OpenSergo CRD standard configurations, without paying attention to the differences between various frameworks and languages, reducing the complexity of heterogeneous and full link service governance and control

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us