Experience and method of improving system performance

1、 Background

The data assembly execution emergency link for fund verification has tens of millions of TPS concurrent volumes. At the same time, due to the characteristics of fund business, the system has very high requirements for availability and accuracy; In the daily development process, we will encounter various high availability problems, and we are constantly trying to do some system design and performance optimization. During this period, we summarized some experience and methods of performance optimization, shared and communicated with you, and will continue to summarize and supplement some new problems encountered later.

2、 What is a high-performance system

First, let's understand what is high-performance design. The official definition is that the core goal of high availability (HA) is to ensure the continuity of business. From the user's perspective, business always provides services normally and stably. The industry generally uses several nines to measure the availability of the system. A series of special designs (redundancy, point removal, etc.) are usually adopted to reduce business downtime, thus maintaining the high availability of its core services.

High concurrency usually means that the system can process many requests in parallel at the same time. It is generally measured by response time, concurrent throughput TPS, number of concurrent users, etc.

High performance means that the program is very fast, occupies less memory, and uses less CPU. High performance indicators are often closely related to high concurrency indicators. If you want to improve performance, you need to improve the system's concurrency ability.

This article mainly introduces and shares the design of "high performance, high concurrency and high availability" services.

3、 How to improve the performance

Every time we talk about high-performance design, we often face several terms: IO multiplexing, zero copy, thread pool, redundancy, etc. There are many articles about this part. In fact, it is a systematic problem in essence, which can be considered from the bottom layer of the computer architecture. System optimization cannot be separated from two dimensions: computing performance (CPU) and storage performance (IO). The following methods are summarized:

• How to design high-performance computing (CPU)

• Reduce computing cost: the time complexity of code optimization calculation O (N ^ 2) ->O (N), reasonable use of synchronization/asynchrony, and flow limiting to reduce the number of requests

• Let more cores participate in computing: multi thread instead of single thread, cluster instead of single machine, etc

• How to improve system IO

• Accelerate IO speed: sequential read and write instead of random read and write, SSD promotion on hardware, etc

• Reduce the number of IO: index/distributed computing replaces full table scanning, zero copy reduces the number of IO replications, DB reads and writes in batches, and increases the number of connections between databases and tables

• Reduce IO storage: data expiration strategy, rational use of memory, cache, DB and other middleware, and good message compression

4、 High performance optimization strategy

1. Computing performance optimization strategy

1.1 Reduce the complexity of program calculation

Simply look at this pseudo code (the business code facade has been desensitized)

boolean result = true;

//Loop through the requests of the request, and return false if it is A service and A service has not reached the final state, otherwise return true

There are several obvious problems in the code:

1. Every time a request comes, we query the DB in line 6, but we judge and filter the request in line 8, resulting in a waste of code computing resources in line 6. In addition, accessing DAO data in line 6 is a time-consuming operation. We can first judge whether the business belongs to A and then query the DB

2. The current requirement is to return false as long as one A service does not reach the final state. After getting false, 11 rows can break directly to reduce the number of calculations

Optimized code

boolean result = true;

//Loop through the requests of the request, and return false if it is A service and A service has not reached the final state, otherwise return true

The calculation time after optimization ranges from 270.75ms ->40.5ms on average

The daily optimization code can be analyzed by ARTHAS tool for the time-consuming calling of the program. The time-consuming tasks should be filtered as far as possible to reduce unnecessary system calls.

1.2 Rational use of synchronous asynchronous

Analyze which business links need to wait for results synchronously and which do not. The core dependent scheduling can be synchronized, while the non core dependent scheduling can be asynchronous as far as possible

Scenario: From the link view, system A calls system B, and system B calls system C to complete the calculation and then returns the conclusion to system A. The timeout of system A is 400ms. Generally, system A calls system B for 300ms, and system B calls system C for 200ms

Now system C needs to return the call conclusion to system D, which takes 150 ms

At this time, the existing call link of system A - system B - system C may fail due to timeout. Since the introduction of system D increases the time consumption by 150 ms, and the whole process is synchronous, system C needs to change the non strong dependence of the update conclusion of system D into asynchronous call

1.3 Do a good job of current limiting protection

Fault scenario: System A calls System B to query abnormal data, which is about 10 TPS or even less daily. One day, System A changes the timing task trigger logic, plus code bugs, and the call frequency reaches 500 TPS. In addition, due to ID errors, System A bypasses the cache and directly queries DB and Hbase, creating a Hbase read hotspot, dragging down the cluster, and affecting storage and query

Subsequently, query flow restriction was made for system A to ensure the concurrency is within 15TPS. Core business services need to do a good job of query flow restriction protection and cache design.

1.4 Multi thread instead of single thread

Scenario: In the emergency positioning scenario, system A calls system B to obtain the diagnosis conclusion. The TR timeout is 500ms. For an abnormal ID event, it is necessary to execute multiple diagnostic services and record the diagnosis flow; Each diagnosis takes about 100ms. With the growth of the business, more than five diagnostic items take 500ms+to calculate. At this time, the service will be temporarily unavailable during the peak period.

Change this code to asynchronous execution, so that the diagnostic time is the most time-consuming diagnostic service

1.5 Cluster computing replaces stand-alone computing

Here, we can use three-layer distribution to partition the computing tasks and execute them. Map Reduce is an idea to reduce the computing pressure of a single machine.

2. System IO performance optimization strategy

2.1 Common FullGC Solutions

There are many common FullGC problems in the system. First, let's talk about the garbage collection mechanism of the JVM. The Heap area is designed in generations, divided into Eden, Survivor, and Tenured/Old. Among them, Eden and Survivor belong to the young generation, while Tenured/Old belong to the old generation. Generally, we call GC in young generation Minor GC, GC in old generation Major GC, and Full GC in the whole heap.

Memory allocation strategy: 1. Objects are allocated in Eden area first. 2 Large objects directly enter the elderly generation 3 Long term survivors will enter the elderly generation 4 Age determination of dynamic objects (virtual machines do not always require objects to reach MaxTenuringThreshold in order to promote the elderly generation. If the sum of the sizes of all objects of the same age in the Survivor space is greater than half of the Survivor, objects with an age greater than or equal to that age can directly enter the elderly generation) 5 As long as the continuous space of the old age is larger than (the total size of all objects in the new generation or the average size of all previous promotions), minor GC will be performed. Otherwise, full GC will be performed.

The system commonly triggers the case of FullGC:

(1) Query large objects: the historical patrol data in the business needs to be cleaned up regularly. The deletion strategy is to delete the data before the last month every day (the business is marked with a soft deletion mark), and completely recover the data when the database is cleaned up regularly;

One day, the deletion policy was changed from "delete data before last month" to "delete data before last week". As a result, the deleted data expanded from 1000 to 150000, and the data objects occupied more than 80% of the memory, which directly led to the FullGC of the system and other tasks were affected;

Many system codes have no limit on the number of data to be queried. With the continuous growth of business, the system capacity will often query out many large object lists without upgrading, and large objects are frequently GC

(2) Set the static method that does not recycle

System A has set a static List object, which is used for DRM configuration reading. However, there is a logic that queries the configuration information and then performs the Put operation. As the business grows, static objects become larger and larger, belong to class objects, and cannot be recycled. Finally, the system frequently GC

It is unreasonable to use an object as the key of a map. At the same time, the objects in the key are not recyclable, resulting in a GC.

When the space is still insufficient after the Full GC is executed, the following error [java. lang. OutOfMemoryError: Java heap space] will be thrown. To avoid the Full GC caused by the above two conditions, the object should be recycled in the Minor GC stage as much as possible during tuning, the object should survive for a period of time in the new generation, and excessive objects and arrays should not be created.

2.2 Sequential read and write instead of random read and write

For an ordinary mechanical hard disk, the performance of random writing will be very poor, and fragments will appear after a long time. Sequential writing will greatly save disk addressing and disk rotation time, and greatly improve the performance; This layer is actually implemented by middleware itself. For example, Kafka's log file stores messages by sequentially writing messages and adding them to the end of the file.

2.3 DB Index Design

When designing the table structure, we should consider the query operation on the table data in the later stage and design a reasonable index structure. Once the table index is established, we should also pay attention to the subsequent query operation to avoid index failure.

(1) Try not to select columns with fewer key values, that is, do not have obvious discrimination, and do not index those with few duplicate data; For example, we use is_ The delete column is indexed to query 100,000 pieces of data, where is_ Delete=0, there are 90000 data blocks. With the cost of accessing index blocks, it is better to scan all data blocks in the full table;

(2) Avoid using the leading like "% * * *" and like "% * * *%", because the previous matching is fuzzy, and it is difficult to access data blocks using the index order, resulting in a full table scan; However, the use of like "A * *%" has no effect, because the search of columns can be stopped when data starting with "B" is encountered. When we fuzzy query data according to user information, we encountered the index failure

(3) For other possible scenarios, such as or query, multi column indexes do not use the first part of the query, query conditions have calculation operations, or full table scanning is faster than index query

At present, tools such as AntMonitor and Tars have helped us scan out SQL that consumes a lot of time and CPU. You can adjust the query logic according to the execution plan. Frequent queries on a small amount of data can make good use of indexes. Of course, building too many indexes also has storage costs. For businesses that insert and delete frequently, you should also consider reducing unnecessary index designs.

2.4 Design of sub warehouse and sub table

With the growth of business, if the number of nodes in the cluster is too large, it will eventually reach the database connection limit. As a result, the number of nodes in the cluster is limited by the number of database connections, and the cluster nodes cannot continue to increase and expand, and cannot cope with the continuous growth of business traffic; This is one of the reasons why Ant does LDC architecture. It does horizontal splitting and expansion in the business layer, so that nodes in each cell only access the database corresponding to the current node.

2.5 Avoid a large number of table JOINs

JOIN is prohibited for more than three tables in the Alibaba coding specification, because the Cartesian product calculation of three tables will lead to a geometric increase in operation complexity. When JOIN multiple tables, ensure that the associated fields have indexes.

If you want to cascade some data in business, you can properly perform nested queries and calculations in memory according to the primary key. For the pipeline table with very frequent operations, it is recommended to make some fields redundant to exchange space complexity for time complexity.

2.6 Reduce a lot of time-consuming calculations in the business flow table

Business records sometimes perform count operations. For statistics and calculations that do not require high timeliness, it is recommended that regular tasks perform calculations in the low peak period of the business, and then save the calculation results in the cache.

It is recommended to use offline tables for Map Reduce calculation when multiple tables JOIN are involved, and then return the calculation results to online tables for display.

2.5 Data Expiration Policy

When the data volume of a table is too large, if partial scanning is not performed according to the index and date, and full table scanning occurs, the query performance of the DB will be greatly affected. It is recommended to design a reasonable data expiration strategy. Historical data should be regularly placed in the history table, or copied to the offline table, to reduce the online storage of large amounts of data.

2.6 Rational use of memory

As we all know, the bottom layer of relational database DB query is disk storage, which is slower than the memory cache. The connection between the cache DB and the business system takes a certain amount of time to call, which is slower than the local memory; However, in terms of storage capacity, the memory storage data capacity is lower than that of the cache. It is recommended that the long-term persistent data be stored in the DB on the disk. The balance between cost and query performance should be considered in the design process.

When it comes to memory, there will be data consistency problems. How to ensure the consistency of DB data and memory data, whether it is strong consistency or weak consistency, how to control the data storage order and transactions, and how to make users unaware as much as possible.

2.7 Data compression

Many middleware adopt compression and decompression operations for data storage and transmission to reduce the bandwidth cost in data transmission. Here, we will not introduce data compression too much. One thing we want to mention is the highly concurrent running business. It is necessary to reasonably control log printing. In order to facilitate troubleshooting, we cannot print too many JSON.toJSONString (Object), and the disk is easily filled, According to the capacity expiration policy of logs, it is also easy to recycle, which is more inconvenient to troubleshoot problems; Therefore, it is recommended to use logs reasonably. Error codes can only be simplified. The core business logic can print summary logs. Structured data can also facilitate subsequent monitoring and data analysis.

When printing the log, consider several questions: Is it possible that someone will look at the log, see what the log can do, whether each field must be printed, and whether problems can improve the efficiency of troubleshooting.

2.8 Hbase Hot Key Problems

HBase is a highly reliable, high-performance, column oriented, scalable distributed storage system and a non relational database. The storage characteristics of Hbase are as follows:

1. The number of columns can be added dynamically. If the column is empty, no data will be stored, saving storage space.

2. HBase automatically splits data so that data storage can automatically have horizontal scalability.

3. HBase can provide support for highly concurrent read and write operations. In the distributed architecture, the probability of waiting for read and write locks is greatly reduced.

4. Conditional query is not supported, but only Rowkey query is supported.

5. The failover of the master server cannot be supported temporarily. When the master server goes down, the entire storage system will fail.

The storage structure of Habse is as follows: The table is divided into multiple HRegions in the row direction. HRegions are the smallest units of distributed storage and load balancing in HBase. That is, different HRegions can be on different HRegionServers, but the same HRegion cannot be split into multiple HRegionServers. HRegions are divided by size. Each table generally has only one HRegion. As data is inserted into the table, the HRegion increases. When a column cluster of the HRegion reaches a threshold (256M by default), it will be divided into two new HRegions.

The rows in HBase are sorted according to Rowkey's dictionary order. This design optimizes the scan operation, and can access the related rows and rows that will be read together in a nearby location for easy scan. Rowkey's inherent design is the source of hot spot failures. Hot spot refers to the fact that a large number of clients directly access one or a few nodes of the cluster (the access may be read, write or other operations).

A large number of accesses will make the single machine where the hot region is located exceed its capacity, causing performance degradation or even unavailability of the region. This will also affect other regions on the same RegionServer. Since the host cannot service requests from other regions, this will cause data hotspots (data skew).

Therefore, when inserting data into HBase, we should optimize the design of RowKey so that the data is written to multiple regions of the cluster instead of one. We should try to evenly distribute the records to different regions and balance the pressure of each region.

Common hot key avoidance methods: inversion, salt adding and hashing

• Inversion: For example, the prefix of user ID 2088 and the same prefix at the beginning of BBCRL can be reversed and moved back appropriately.

• Salt adding: add some prefixes in front of RowKey, such as timestamp Hash. The more types of prefixes you add, the more they will be distributed to each region according to the randomly generated prefixes, avoiding hot spots, but you should also consider the convenience of scan

• Hash: In order to completely reconstruct the RowKey in business, the prefix cannot be random. Therefore, the original RowKey or part of it is used to calculate the hash value, and then the hash value is used as the prefix.

In a word, Rowkey should try to ensure the length principle, uniqueness principle, sorting principle and hash principle in the design process.

5、 Design scheme of actual combat emergency link system

To ensure the high availability of the overall service, we need to look at the design of the high availability system from the perspective of the full link. Here, we simply share a business scenario in which multiple upstream systems call the exception handling system to perform emergency response, and analyze the performance optimization and transformation.

Take the capital emergency system as an example to analyze the performance optimization in the system design process. As shown in the figure below, the exception handling system involves multiple upstream apps (1-N), which send "difference log data" to the message queue. The exception handling system subscribes to and consumes the "error log data" in the message queue, and then performs operations such as parsing, processing and aggregation of this data to complete exception sending and emergency handling.

• High availability design in the sending phase

• Production message stage: the local queue caches the exception details, and the daemon thread pulls them regularly and sends them in batches (the performance problem reported by a single item in optimization scheme 1)

• Message compression and sending: an assembled model is used for exception rule reuse, and the code is aggregated, compressed and reported according to the rules (optimizing the data compression and reuse capability of the business layer)

• Middleware helps you with efficient message serialization mechanism and zero copy sending technology

• Storage phase

• At present, Kafka and other middleware adopt the mechanism of IO multiplexing+disk sequential data writing to ensure IO performance

• At the same time, partition and segmentation storage mechanism is used to improve storage performance

• Consumption stage

• Regularly pull a piece of data for batch processing, and then report to the consumption site for further calculation

• Internal easy to do idempotent control of data, jitter or single machine failure during the release process to ensure that data will not be recalculated

• To improve the count performance of the DB, first use Hbase to accumulate the number of exceptions, and then use timed threads to obtain batch updates of data

• In order to improve the configuration query performance of the DB, the first query configuration is stored in the local memory for 20 minutes, and the memory fails after the data is updated

• Explorer is used to store statistical calculations, Hbase is used to store unstructured exception details, and OB is used to store structured and highly reliable exception data

1. Then perform pressure testing and capacity evaluation on the system performance, isolate the flow when the drill data is 3-5 times of the abnormal data, split the pipeline, and isolate the thread pool of the consumption link

2. Redundancy and failover shall be done for the calculation module of a single point, and current limiting measures shall be taken

Current limiting capacity, the reporting end adopts switch to control current limiting and fusing

Failover capability

3. For areas that can be improved within the system, refer to the high availability performance optimization strategy to make breakthroughs one by one

6、 High performance design summary

1. Architecture design

1.1 Redundancy

Do a good job of active replication of three or even five replicas of the cluster, and ensure the success of all data redundancy scenarios before the task can continue to execute. If the availability requirements are high, you can reduce the number of replicas and the task submission execution constraint.

Redundancy is easy to understand. If the availability of one system is 90% and the availability of two machines is 1-0.1 * 0.1=99%, the more machines, the higher the availability will be; For DB, which has a bottleneck in the number of connections, we need to do a good job in the business of sub databases and sub tables, which is also a redundant horizontal expansion capability.

1.2 Failover capability

Some business scenarios are highly dependent on DB. If the DB is unavailable, can you transfer to the FO database or interrupt the site first, save the context, write the current business scenario context to the delay queue, and consume and calculate the data after the failure recovery.

Some force majeure and third-party problems may seriously affect the availability of the entire business. Therefore, it is necessary to do a good job of remote multi talk, redundant disaster recovery and regular drills.

1.3 System resource isolation

In the case of exception handling, the queue is often blocked due to a large amount of upstream data reporting, which affects timeliness. Therefore, the core business and non core business resources can be isolated. For the second kill scenario, independent clusters can even be deployed to support the business.

If the availability of system A is 90% and that of system B is 40%, and a service of system A is strongly dependent on system B, then the availability of system A is P (A | B), which greatly reduces the availability.

2. Defense in advance

2.1 Monitoring

Set reasonable monitoring thresholds for CPU, thread CE, IO, service call TPS, DB calculation time, etc. of the system, and make emergency response in case of problems

2.2 Current limiting/fusing/degradation

In the case of sudden increase of upstream business traffic, there needs to be a certain self-protection and circuit breaker mechanism. The premise is to avoid strong dependence on the business and solve the single point problem. In the abnormal consumption link, DRM control is performed on the upstream, and the downstream also has a certain rapid flood discharge capacity to prevent the entire cluster from being unavailable due to the single business exception.

Instantaneous flow problems are easy to cause failures. We must do a good job in voltage measurement and fusing. Seckill businesses should reduce their strong dependence on the core system. We should do a good job in plan management and control in advance. We should also have a certain warm-up and protection mechanism for cache avalanches.

At the same time, some businesses have opened unreasonable interfaces. They use a large number of web interfaces such as crawlers to request. They should also have the ability to identify and fuse

2.3 Improve code quality

During the promotion period, the core business should be blocked, the fund security should be deployed in advance to check the reliability of the active verification code, and the code should conform to the specification, which are all defensive measures to avoid online problems;

Full GC of code and memory leak will cause system unavailability. Especially, the low peak period of business may not be obvious, and the performance will deteriorate when the business traffic is high. Do pressure testing and code review well in advance.

3. Post defense and recovery

Monitor and grayscale in advance, and roll back faults in any scenario afterwards.

Other defense capabilities include: how to do a good job in the smooth release of code during deployment, and how to quickly pick up traffic from the problem code machine; How to ensure the dependency order for the release of upstream and downstream system calls; During the publishing process, normal business has been executed in published code, and reverse operations are executed in unpublished machines. How to ensure business consistency should be fully considered.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us