Ten years of changes in Ali database

As the tenth Double 11 is approaching, Alibaba Technology launched the "Ten Years of Shepherding Code" series, inviting the core technology experts who participated in the preparations for Double 11 over the years to review the changes in Alibaba technology.

Today, Zhang Rui, researcher of Alibaba Database Division, will tell you the unknown story of Double 11 database technology. Behind the increasing number of zero-point transactions, there are not only breakthroughs in database technology, but also the never-say-die spirit of Alibaba technicians. The unremitting pursuit of technology.

Zhang Rui, Researcher of Alibaba Database Division

In a few days, we are about to usher in the tenth Double 11. In the past ten years, the role of Alibaba's technology system has changed, from promoting technology development during Double 11 to technology creating new business. Many technologies have begun to be exported through cloud computing, becoming inclusive technologies, serving various industries, and truly promoting the development of social productivity.

In the past ten years, the Alibaba database team has always had a mission: to promote the transformation of China's database technology. From commercial databases to open source databases to self-developed databases, we have been working hard for this mission.

If we divide the development history of Alibaba database into three stages, they are:

● The first stage (2005-2009) era of commercial database;

● The second stage (2010-2015) era of open source database;

● The third stage (2016-present) is the era of self-developed databases.

The era of commercial databases is known as the era of IOE. Later, a major event happened to "go to IOE": through distributed database middleware TDDL, open source database AliSQL (Alibaba's MySQL branch), high-performance X86 server and SSD, and Through the joint efforts of DBA and business development students, it has successfully replaced the commercial database Oracle, IBM minicomputer and EMC high-end storage, and has since entered the era of open source databases.

Going to the IOE brings three significant implications:

The first is to solve the problem of scalability, so that the database has the ability to expand horizontally (elasticity), which has laid a good foundation for the peak transaction value of Double 11 in the next many years.

The second is self-controllability. We have added a lot of features to AliSQL, such as: inventory hotspot patches, SQL current limiting protection, thread pools, etc. Many features are derived from the technical requirements of Double 11 for databases, which are in commercial use. The database age is completely impossible.

The third is stability. It turned out that in the era of commercial databases, it was like putting all the eggs in one basket (minicomputer). After going to IOE, it not only solved the single machine failure, but also made the database cross-border through the multi-active architecture upgrade in different places. Beyond the city limits, multi-activity and disaster recovery between database cities can be achieved, greatly improving the availability of the system.

In 2016, we began to develop our own database, codenamed X-DB. Everyone must ask: Why do you want to develop your own database? There are several reasons:

First, we need a native distributed database that can be deployed globally, similar to Google Spanner.

The second is the double 11 scenario that puts forward extremely high requirements on the database:

● Performance: The database needs to provide very high read and write capabilities at the zero point of Double 11;

● Storage cost: The database uses SSD to store data, and the data has obvious characteristics of hot and cold. A large amount of cold historical data and hot online data are stored together. Over time, a lot of valuable storage space is occupied, and the pressure of storage cost is increasing. bigger. After careful evaluation, we found that if we continued to improve on the basis of open source databases, we could no longer meet business needs.

The third is the emergence of new hardware technologies. If the large-scale use of SSD and the great improvement of X86 server performance have promoted the process of de-IOE, then NVM non-volatile memory, FPGA heterogeneous computing, RDMA high-speed network and other technologies will The second time to promote the change of database technology.

With the preparations for Double 11 every year, the preparation of machine resources is a very important part. How to reduce the cost of machine resources for Double 11 has always been a difficult problem for Alibaba technicians to constantly challenge themselves. The first solution is to use cloud resources. Since the beginning of 2016, the database has tried to use high-performance ECS to solve the machine resource problem of Double 11. Through the continuous training of Double 11 in the past few years, in 2018 Double 11, we can directly use the public cloud ECS, and form a hybrid cloud through the VPC network and the internal environment of Alibaba Group, realizing the flexible promotion of Double 11.

The second solution is online and offline co-location, where offline tasks are run on online (application and database) servers on a daily basis, and offline computing resources are used for online applications during the Double 11 promotion. To achieve this flexibility, the core of the database must be solved. A technical problem is: the separation of storage and computing. After the storage and computing are separated, the database can use offline computing resources during Double 11, thus achieving extreme elasticity. By using cloud resources and co-location technology, the cost brought by the double 11 transaction peak can be minimized.

Another important technological trend in the preparations for Double 11 is intelligence. The combination of database and intelligence is also a direction we have been exploring, such as Self-driving Database. In 2017, we used intelligent technology to automatically optimize SQL for the first time. In 2018, we plan to promote SQL automatic optimization and space automatic optimization on the whole network. We hope that these technologies can be used to reduce the workload of DBAs and improve the efficiency of developers. And effectively improve the stability. I believe that in the future, in the preparations for Double 11, there will be more and more tasks that can be done by machines.

I have participated in the preparations for Double 11 since 2012. I have served as the captain of the database and the chief of the technical support department many times. During the preparations for so many years, I have also experienced many interesting stories, and I will share some of them here.

2012: My first double 11

2012 was my first Double 11. Before that, I was in the B2B database team. At the beginning of 2012, the entire group's infrastructure team was merged into the technical support department, led by Zhenfei. I had never participated in Double 11 before, but in my first year of participating in Double 11, Hou Yi (the head of the database team) gave me the responsibility of being the captain. The pressure can be imagined. At that time, the preparation for Double 11 was very long, and preparations began in May or June. The most important work was to identify risks and accurately assess the pressure of each database.

We need to convert the ingress traffic into the pressure QPS of each business system, and then we convert the QPS of the business system into the QPS of the database. In 2012, there is no full-link pressure measurement technology, and we can only rely on the line of each business system. The next test, and each professional line captain meeting again and again to review to find potential risks.

It is conceivable that there is huge uncertainty here. Everyone does not want the business they are responsible for to become a shortcoming, and machine resources are often limited. At this time, it is completely dependent on the experience of the captain. Therefore, each captain must The pressure is enormous. I remember that the team leader of Double 11 was Li Jin. It is said that he was so stressed that he could not sleep. He could only drive to the top of Longjing Mountain at night and open the car window to take a nap.

And I, because it was my first year to participate in Double 11, I had zero experience and was in a state of anxiety to death. Fortunately, I had a group of very reliable brothers with me back then. They had just gone through the baptism of IOE and had been with the business for a long time. The development is immersed together, and the business architecture and database performance are well-known and well-informed. With their help, I have basically figured out the structure of the entire trading system, which is very helpful for my future work.

After months of intense preparations, the day of Double 11 has finally arrived. We are fully prepared, but everything is so uncertain, and we are in a very uneasy mood. When zero o'clock comes, the worst will happen. It still happened: the inventory database was completely overwhelmed, and the IC (commodity) database's network card was also full. I remember very clearly, at that time we looked at the monitoring indicators on the database and were helpless. Here's a little detail: Since we didn't estimate such a big pressure at all, the pressure indicator of the database on the monitoring screen at that time showed more than 100%.

At this moment, the technical commander-in-chief Li Jin shouted: "Everyone, don't panic!" At this time, we looked up and saw that the number of transactions continued to hit a new high, and our hearts calmed down a little. In fact, the network card of the IC database was full, and the inventory database exceeded our expectations, so in the end we didn't do any plan and just passed the peak of zero.

For these reasons, the Double 11 in 2012 produced a large number of oversolds, which brought great losses to the company. After the Double 11 that year, students in inventory, commodities, refunds and the corresponding database, in order to deal with the problems caused by overselling, worked day and night to add two weeks of classes. Also, many of my friends around me are complaining that the user experience at peak times is really bad. We are determined to solve these problems on Double 11 in the second year.

2013: Inventory hotspot optimization and the humble shadow table

After the Double 11 in 2012, we started to solve the performance improvement of the inventory database. The inventory deduction scenario is a typical hot issue, that is, multiple users compete to deduct the inventory of the same product (for the database, the inventory of a product is a row of records in the database), and the update of the same row in the database Concurrency is controlled by row locks. We found that when a single thread (queuing) to update a row of records, the performance is very high, but when a lot of threads update a row of records concurrently, the performance of the entire database will drop to a terrible, close to zero.

At that time, the database kernel team made two different technical implementations: one was a queuing scheme, and the other was a concurrency control scheme. Both have their own advantages and disadvantages, and the solution is to turn disorderly scrambles into orderly queues, thereby improving the performance of hot-spot inventory deductions. Through continuous improvement and PK, the two technical solutions have finally achieved maturity and stability, meeting the performance requirements of the business. In the end, in order to be foolproof, we integrated the two solutions into AliSQL (Alibaba's MySQL branch), and can switch control. Finally, through a whole year of hard work, we solved the problem of inventory hot spots in Double 11 in 2013, which was the first performance improvement in inventory. After the Double 11 in 2016, we made another major optimization, which increased the inventory deduction performance by ten times on the basis of 2013, which is called the second inventory performance optimization.

2013 is a milestone year in the history of Double 11, because a breakthrough technology appeared in this year - full-link stress testing. I admire Li Jin, the person who first proposed the concept of full-link stress testing. He asked us at that time: Is it possible to conduct full-simulation testing in an online environment? The answer to all is: Impossible! Of course, I think this is even more impossible for the database. The biggest worry is how to deal with the data generated by the pressure measurement traffic. I have never heard of a company that dares to do pressure measurement on the online system. If there is a problem with the data, this The consequences will be very serious.

I remember that on a hot afternoon in 2013, when I was struggling with the problem of the inventory database, Uncle Tong (the technical person in charge of the full-link stress test) came to me to discuss the technical solution of the full-link stress test database. In that afternoon, the two of us discussed a "shadow table" solution, that is, to establish a set of "shadow tables" in the online system to store and process all the pressure measurement data, and the system to ensure the structure of the two sets of tables. Synchronize. However, we have no clue about this matter. I believe that in the first few weeks of Double 11, few people believed that the full-link stress test could be implemented. Most of our preparations were based on manual review + offline stress test. conduct. However, after everyone's efforts, a breakthrough was made in the two weeks before Double 11. When the first full-link stress test was successful, everyone couldn't believe it.

Finally, on the first few nights of Double 11, a full-link stress test was conducted almost every night, and everyone enjoyed it, which left a deep impression on me. However, this process was not all smooth sailing. We experienced many failures, wrote wrong data many times, and even affected the report the next day. The long-term high-pressure stress test even affected the life of the machine and SSD. Even with so many problems, everyone is still moving forward firmly. I think this is what sets Alibaba apart, because we believe and see. Facts have also proved that the full-link stress test has become the most effective killer in the preparations for Double 11.

Today, the full-link stress measurement technology has become a product on Alibaba Cloud, and has become a more inclusive technical service for more enterprises.

2015: The story behind the big screen

In 2015, I changed from the captain of the database to the captain of the entire technical support department, responsible for the preparations for the Double 11 in the entire technical facilities field, including all technical fields such as IDC, network, hardware, database, CDN, and application. It was a new challenge for me in so many areas of expertise. This time, however, I was stumped by a new problem: the large screen.

In 2015, we held the Tmall Double 11 Gala for the first time. This year, for the first time, the Gala and the media center were not located in the Hangzhou Park, but at the Beijing Water Cube. The grand occasion of the day requires the cooperation of Beijing and Hangzhou. The difficulties and challenges can be imagined! We all know that the two most important moments for the media live broadcast on the big screen, one is the moment when the double 11 starts at 0:00, and the other is the moment when the double 11 ends at 24:00. The whole process requires the media to live broadcast the beating GMV numbers on the big screen. Try not to delay as much as possible. In that year, in order to improve the experience of the Beijing Water Cube and the interaction with the Hangzhou General Command Center, there was a countdown session before 0:00, connecting to the Hangzhou Guangmingding Operation Command Room, and Xiaoyaozi will unveil it for everyone. The 2015 Double 11 was launched, and then switched directly to our media big screen, so the requirement for GMV numbers is basically zero delay, the challenge is self-evident! However, the first full-link stress test was very unsatisfactory, with a delay of more than tens of seconds. The commander-in-chief at that time, Zhen Fei, resolutely said that the first digit of GMV must be beat within 5 seconds. Getting real-time transaction numbers within 5 seconds and requiring that this number be accurate is an impossible task for everyone. At that time, the director team also proposed another plan. After 0:00 on Double 11, a special effect of digital beating (not the real number) can be displayed first. When we are ready, we can switch to the real transaction number. For the screen, all the data on the screen must be true and correct (this is the value of Ali people), so we can't use this bottom-up solution, everyone thinks about how to make the delay within 5 seconds, the same day In the evening, all relevant teams set up a large-screen technical research group to start closed technical research. Don’t look at a small number, it involves real-time calculation of application and database logs.All the links in the whole link, such as calculation, storage and display, are truly cross-team technical breakthroughs, and ultimately live up to expectations. The first digit of our double 11 zero o'clock GMV beats in 3 seconds, and it is strictly controlled within 5 seconds, which is very, very difficult. of! Not only that, in order to ensure that the entire large-screen display is foolproof, we have made dual-link redundancy, similar to aircraft dual-engine, the two links are calculated at the same time, and can be switched in real time.

I think everyone must not understand that there are so many stories and technical challenges behind a small number on the big screen. Double 11 is like this. It consists of countless small links, and behind it is the dedication of every Ali person.

2016: Eat your own dog food

Anyone who has worked on large-scale systems knows that monitoring systems are like our eyes. Without it, we don't know what happens to the system. Our database also has a monitoring system. Through the agent deployed on the host, we regularly collect key indicators of the host and database, including: CPU and IO utilization, database QPS, TPS and response time, slow SQL logs, etc., and put these Metrics are stored in a database for analysis and presentation, which was originally MySQL.

As the scale of Alibaba's database becomes larger and larger, the entire monitoring system has become a bottleneck. For example, the collection accuracy is limited by system capabilities. At first, we could only do it for 1 minute. Later, after years of optimization, we improved the collection accuracy. to 10 seconds. However, the most embarrassing thing is that every year before 00:00 on Double 11, we usually have a plan: to downgrade the monitoring system, such as reducing the collection accuracy, turning off certain monitoring items, and so on. This is because the pressure during the peak period is too great and it is a last resort.

Another business challenge came from the security department. They put forward a request to us, hoping to collect every SQL running on the database and send it to the big data computing platform for analysis in real time. This requirement is even more impossible for us, because the SQL running at every moment is very huge, and the usual approach can only be done by sampling. Now the requirement is to record a record without leakage and be able to analyze it, which is very challenging. big.

On Double 11 in 2016, we started a project: redesigning our entire monitoring system. Goal: To have second-level monitoring capabilities and full SQL collection and computing capabilities, and the double 11 peak will not be degraded. The first is to solve the storage and calculation problems of massive monitoring data. We chose TSDB, a self-developed time series database developed by Alibaba, which is specially designed for massive time series data in scenarios such as IOT and APM. The second is to solve the problem of full SQL collection and calculation. We built a real-time SQL collection interface in AliSQL. After SQL execution, it is directly transmitted to the stream computing platform through the message queue without writing logs for real-time processing, realizing full Analysis and processing of SQL. After solving these two technical problems, on Double 11 in 2016, we achieved the business goals of second-level monitoring and full SQL collection.

Later, these monitoring data and full SQL became a huge treasure trove to be mined. By analyzing these data and combining with AI technology, we launched the CloudDBA database intelligent diagnosis engine. We believe that the future of the database is the Self-drving database, which has four characteristics: self-diagnosis, self-optimization, self-decision and self-recovery. As mentioned above, we have made some progress in the direction of intelligence.

Now, TSDB is already a product on Alibaba Cloud, and CloudDBA not only serves tens of thousands of engineers inside Alibaba, but also provides database optimization services for users on the cloud. We not only eat our own dog food and solve our own problems, but also use Alibaba's scenarios to continuously hone our technology and serve more cloud users. This is how Double 11 promotes technology.

2016-2017: Databases and caches

In the history of Double 11, Alibaba's self-developed cache-Tair is a very important technical product. It is because of the help of Tair that the database has carried such a huge amount of data access on Double 11. While using Tair on a large scale, developers also hope that the database can provide a high-performance KV interface, and the data queried through the KV and SQL interfaces are consistent, which can greatly simplify the workload of business development. Therefore, X-KV It is a KV component of X-DB. By bypassing the process of SQL parsing and directly accessing data in memory, it can achieve very high performance and a response time several times lower than the SQL interface. X-KV technology was applied for the first time on Double 11 in 2016, and the feedback from users is very good, and the QPS can reach hundreds of thousands. On Double 11 in 2017, we made another black technology to automatically realize the conversion between SQL and KV through the middleware TDDL. Development no longer needs to develop two sets of interfaces at the same time. You only need to use SQL to access the database, and TDDL will automatically convert it in the background. SQL is automatically converted to KV interface, which further improves the efficiency of development and reduces the load of the database.

On Double 11 in 2016, Tair encountered a technical problem in the industry: hot spots. We all know that a key in the cache system can only be distributed on one machine, but during Double 11, the hot spots are very concentrated, and the traffic volume is very large, it is easy to exceed the capacity limit of a single machine, and the CPU and network card will become bottlenecks . Due to the unpredictability of hot spots, it may be traffic hot spots or frequency hot spots. As a result, on Double 11 in 2016, we were like firefighters everywhere to put out fires and were exhausted. In 2017, the students of the Tair team were thinking about how to solve the technical problems in this industry, and innovatively proposed a technical solution for adaptive hotspots: the first is intelligent recognition technology, Tair uses a multi-level LRU data structure internally, By setting different weights for the frequency and size of the access data keys, they are placed on LRUs of different levels, so that the batch of keys with high weights can be retained when they are eliminated. Those that are finally retained and exceed the threshold setting will be judged as hot keys. The second is dynamic hashing technology. When a hot spot is found, the application server and the Tair server will link up, and according to the preset access model, the hot spot data will be dynamically hashed to the HotZone storage area of ​​other data nodes on the Tair server. to visit.

Hotspot hashing technology achieved very significant results in Double 11 in 2017. By hashing hotspots to the entire cluster, the water level of all clusters was lowered below the safety line. Without this capability, there may be problems with many Tair clusters during Double 11 in 2017.

It can be seen that the database and the cache are a pair of good partners that rely on each other. They learn from each other and learn from each other's strengths to jointly support the storage and access of massive data on Double 11.

2016-2017: How the Silky Smooth Trading Curve Was Made

Since the technology of full-link stress testing is in place, we hope that the trading curve of Double 11 zero o'clock every year will be as smooth as silk, but things often don't go as smoothly as expected. After 0:00 on Double 11 in 2016, the trading curve fluctuated a little, and then it climbed to the highest point. When reviewing the data after the event, we found that the main problem was that the shopping cart and other databases were at 0:00. Since the data in the buffer pool was "cold", when a large number of requests came at 0:00, the database needed to be "hot" first. , the data needs to be read from the SSD to the buffer pool, which leads to a longer response time for a large number of requests in an instant, which affects the user experience.

After knowing the cause of the problem, in 2017, we proposed the "preheating" technology, that is, before the Double 11, let each system fully "warm up", including Tair, database, application, etc. For this purpose, a set of preheating was specially developed. Hot system, preheating is divided into two parts: data preheating and application preheating. Data preheating includes: database and cache preheating. The preheating system will simulate application access, and load data into the cache and database through this access. , to ensure the hit rate of cache and database BP. Application warm-up includes: pre-established connection and JIT warm-up. We will establish database connection in advance before 0:00 on Double 11 to prevent the overhead of establishing connection during peak hours. At the same time, because the business is very Complex, and JAVA code is interpreted and executed. If JIT compilation is performed at the same time during peak hours, a lot of CPU will be consumed, and the system response time will be prolonged. Through JIT preheating, the code can be fully compiled in advance.

Double 11 in 2017, because the system has been fully warmed up, the trading curve has drawn a perfect curve at zero.

2017-2018: Technological breakthrough in the separation of storage and computing

At the beginning of 2017, the senior technical students of the group initiated a technical discussion: should we separate storage and computing? This sparked a long-lasting debate. Including when I was in Dr. Wang's class, there was also a technical debate on this issue. Because the two sides were evenly matched, in the end, no one persuaded the other. For databases, the separation of storage and computing is a very sensitive technical topic. We all know that in the era of IOE, the connection between the minicomputer and the storage through the SAN network is essentially a storage and computing separation architecture. Now we have to go back to this architecture, is it a technological regression? In addition, for the database, the response delay of IO directly affects the performance of the database. How to solve the problem of network delay? Various questions have been plaguing us without any conclusion.

At that time, the database can already use cloud ECS resources for large-scale elastic expansion, and has achieved containerized deployment. However, there is a problem that we cannot avoid anyway: if computing and storage are bound together, extreme elasticity cannot be achieved, because the migration of computing nodes must "relocate" data. Moreover, we have studied the growth curve of computing and storage capacity. We found that during the double 11 peak, the requirements for computing power increased sharply, but the requirements for storage capacity did not change significantly. If the separation of storage and computing can be achieved, the dual 11 During the peak, we only need to expand the computing nodes. To sum up, the separation of storage and computing is a road in Huashan, and it must be done.

In mid-2017, in order to verify the feasibility, we chose to optimize on the basis of the open source distributed storage Ceph. At the same time, Alibaba's self-developed high-performance distributed storage Pangu 2.0 is also in full swing. On the other hand, the database kernel team is also involved, reducing the impact of network latency on database performance through database kernel optimization. After everyone's joint efforts, the computing and storage separation scheme based on Pangu 2.0 was finally implemented on Double 11 in 2017, and the flexible scheme of using offline head to mount shared storage was verified. After this Double 11, we have proved that the separation of database storage and computing is completely feasible.

The success of the separation of storage and computing is inseparable from a hero behind the scenes: high-performance and low-latency network. In 2017 Double 11, we used a 25G TCP network. In order to further reduce latency, we used RDMA technology on a large scale in 2018 Double 11. The magnitude reduces the network delay, and such a large-scale RDMA application is unique in the entire industry. In order to reduce the IO delay, we also made a big killer in the file system - DBFS, through the user mode technology, bypass the kernel, and realize the zero copy of the I/O path. Through the application of these technologies, the delay and throughput close to the local storage are achieved.

Double 11 in 2018, with the large-scale use of storage computing separation technology, marked the database has entered a new era.


In the six years from 2012 to 2018, I have witnessed the increase in the number of zero-point transactions again and again, the breakthrough of the database technology behind it, and the never-say-die spirit of the Ali people. The process from "impossible" to "possible" is the relentless pursuit of technology by Alibaba technicians.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us