Community Blog 5-year Evolution of Ele.me's Transaction System – Part 2

5-year Evolution of Ele.me's Transaction System – Part 2

In this blog, Wanqing talks in depth about the evolution of Ele.me, Alibaba's online food delivery service platform, drawing from his personal experience in the team.

The Alibaba Cloud 2021 Double 11 Cloud Services Sale is live now! For a limited time only you can turbocharge your cloud journey with core Alibaba Cloud products available from just $1, while you can win up to $1,111 in cash plus $1,111 in Alibaba Cloud credits in the Number Guessing Contest.

By Wanqing

Database and Table Sharding

In September or October 2015, we began to implement database and table sharding. The sharding plan had almost been finalized when I joined the project. The implementation was led by the DAL team from the CI department.

Why Did We Shard Databases and Tables?

First, the original system could not support high concurrency. At that time, our order system had one primary MySQL database, five secondary MySQL databases, and one additional MySQL database used for master high availability (MHA). At the time, the databases could not support concurrent requests and had weak resistance to risks. If the business department did not inform us of promotional events in advance, when our secondary database went down, we could only switch between the available databases and throttle a large amount of traffic to prevent system crashes. Moreover, at that time, we prayed every day that the order system of Meituan Takeout would not crash during peak hours. If it crashed, more traffic would flow to Ele.me and we would begin to get nervous. Similarly, if our order system crashed, the Meituan Takeout system would have a hard time dealing with the extra traffic. We were both at a similar stage of development.

Second, DDL costs were too high while our business fighting its decisive battle. At that time, Ele.me averaged millions of orders per day. To meet certain business requirements, we needed to add fields to the order system. However, when we asked the database administrator (DBA) to assess the risks, the DBA answered that the operation would cause the order system to stop service for three to five hours and we needed approval from the CEO. Apparently, our technical team could not afford the risk, and neither could the business team. Therefore, we came up with a workaround: We would insert the required fields into the reserved JSON extension fields. This solution solved the problem for a time, but also generated many potential risks.

In addition, some special business scenarios and coarse-grained open APIs generated SQL statements with extremely low performance. All these risks could lead to system crashes.

The following figure shows the physical structure after sharding.


The following figure shows the logic of an update operation.


Actually, we implemented two-dimensional sharding, with 120 shards for each dimension. However, these shards could be routed by user ID, merchant ID, or order ID. Specifically, write requests were preferentially routed by user ID. Due to resource limitations, user shards and merchant shards were deployed in hybrid mode.

Stages of Table Sharding

I will not elaborate on the technical details of database and table sharding here. The sharding process mainly included the following stages:

  1. Develop new order ID generation rules and deploy the rules in the system.
  2. Write data to the database and the cache simultaneously, read old data, and compare data.
  3. Modify incompatible SQL statements, such as SQL statements executed for cross-shard sorting, SQL statements executed to collect statistics, and SQL statements without shard keys.
  4. Write data to the database and the cache simultaneously and read new data. This step was partially synchronous with step 3.
  5. Switch between databases and write and read new data.

During this period of time, as a business team, we actually spent most of our time on step 3 and had to work overtime until 3:00 or 4:00 a.m. several times.

Before the 2016 Spring Festival, to withstand the holiday business peak and ensure system stability, we went so far as to archived the data in databases and retained only the orders generated within the last 15 days.

I remember on the final day of migration, probably in the middle of March 2016, several team members and I arrived at the company at about 5:00 a.m. when the dawn was just breaking. The whole Ele.me system went down and write requests were blocked. We completed the configurations for databases and verified the configurations. Finally, we resumed write requests, verified the businesses, gradually released the frontend traffic, and resumed service. The critical process took about 10 minutes and the service was completely resumed half an hour after service was stopped.

On the next day, we were finally able to import historical orders generated within the last three months.

After this change, we got rid of most of the bottlenecks and pain points related to databases.

Message Broadcast

Around July 2015, influenced by some articles about architectures and instructions from JN, we decided to implement message broadcast in the order system to take decoupling a step further.

After researching RabbitMQ, NSQ, RocketMQ, Kafka, and ActiveMQ, we ultimately decided to select RabbitMQ. In fact, I thought RocketMQ was more suitable for the system at that time, especially for ordered messages, because it could provide natural support in some transaction business scenarios. However, the O&M team had more experience in maintaining RabbitMQ. The framework team and the O&M team were very confident and had never caused any problem since we built the order system. If we chose RabbitMQ, we could obtain natural support from the O&M team and our business team could avoid many risks.

Therefore, the framework team performed a rigorous performance test on RabbitMQ and provided some performance metrics. After this test, a cluster consisting of three brokers was built to separately serve the order system. Previously, only one MQ node had been provided to serve asynchronous message tasks in the Zeus system.

To ensure that the cluster would not affect the main transaction processes, we carried out a series of fault tolerance optimizations on the SOA framework on the client side. Our fault tolerance work focused on the timeout period for requests sent to connect to the RabbitMQ cluster, disconnection from the RabbitMQ cluster, asynchronous message sending, and the maximum number of retries. Ultimately, we built a brand new RabbitMQ cluster that consisted of three nodes and received all order messages.

During this period, I actually made a small mistake. Although the framework team had already carried out fault tolerance for exceptional cases, the message broadcast sending time was closely related to the status of main processes. I was always careful, but before code release, I added a message sending switch for the initial release. This was at about 8:00 p.m. With what I know now, I would have spent more time on canary release and observation. Shortly after I released all the code, the monitoring results showed that many API operations started to time out. (We used the default timeout period of the framework, which was 30s. This setting was actually risky.) Obviously, something had slowed down the API operations. After the transaction curve dropped sharply, I was immediately placed on call by the NOC. I immediately disabled the message sending switch. The curve recovered instantly. Then, I turned to the architecture team and asked them to help find the cause of the problem. The real cause was my lack of skills.

That evening, we enabled and disabled the switch repeatedly and gradually increased the traffic from 5% to 10%, 30%, and more. After various tries and verifications, we concluded that the problem was related to the HAProxy configurations. The HAProxy would disconnect from the RabbitMQ cluster while the service client still sent requests based on the previous connection. In addition, the client did not provide fault tolerance for such timeout scenarios. The problem disappeared after we adjusted the connection timeout configuration of the HAProxy, although some potential risks were remained, as shown in the log.

In this case, each connected service provider had to apply for a topic and could determine the number of queues for the topic based on service requirements.


This physical architecture ran stably for less than a year and had many problems, which will be explained in the next section.

At the time, we set the following usage rules:

  1. The status of orders must not be directly exposed to the public. Instead, orders must be exposed as events. A state is a description while an event represents an action and decouples the details of the order status from the connected party.
  2. The message broadcast feature was used only to broadcast events, but not to synchronize data. If a consumer needed more data, the consumer could call the reverse lookup API to query order data. The returned timestamp contained the time when the event was generated and the time when the event was sent (this time was added subsequently). The message body contained the header information, which only included the content used to explain the event. The message body also contained the primary keys of both parties to the transaction and certain information that could be used for general filtering or secondary routing.
  3. Consumers must ensure that they are idempotent and stateless when consuming messages. If ordered message consumption is required, use Redis and other solutions.
  4. When a consumer connects to the order system, the topics and queues of the consumer had to comply with certain naming rules. In addition, a queue could contain a maximum of 10,000 accumulated messages. Queues that exceeded this limit had to be discarded. Consumers must specify whether they can receive damaged messages and must ensure their own consumption performance. According to an evaluation at the time, when 1 million messages were accumulated, the performance of the entire cluster deteriorated by 10%. Based on suggestions for the global architecture, we also provided Redis as a medium and an image to store order events. However, the experience was not elegant.

This logical architecture for message broadcast is still in use and has produced huge benefits in terms of decoupling.

Initial Exploration

From the middle of 2015 to the beginning of 2016, we had to process more than a million orders per day and the number of orders grew rapidly.

Order Service Center (OSC)

During that period, we also read about many architectures, including ESB, SOA, microservices, CQRS, and EventSource. We actively discussed how to reconstruct the order system to support higher concurrency. At that time, the Order Fulfillment Center (OFC) of JD.com, Inc. (JD) was widely discussed. In particular, we bought and read a book entitled Secrets of Jingdong Technologies (京东技术解密), but quickly concluded that it was not relevant to our order system. The OFC of JD was obviously determined by the characteristics of the retail business. Many concepts used by the OFC were difficult to understand for beginners like us and we could not see how to apply them to O2O. However, we were still deeply influenced by the OFC and we gave our group a similar abbreviation, OSC, which stands for Order Service Center.

Since the order system had been in service for more than three years and most engineers working on the main language stacks had shifted from Python to Java, we planned to rewrite the order system. Therefore, I designed an architecture system, using osc as the domain prefix of applications. The core idea of this order system was to store the snapshots taken of transactions, simplify the system, reduce dependencies on other parties, and reduce the system's roles as a data tunnel.

We chose Java as the language stack. In other words, we planned to convert to Java. (Unfortunately, the shift to Java was only completed in 2019.)

It was September. Coincidentally, the company began to set up an architecture review system for new services. My plan was probably the first or second plan that was reviewed and it was quickly rejected.

In fact, looking back from one year later, I quite appreciated this architecture review, precisely because my plan failed to pass the review.

Actually, the review was quite interesting. I vaguely remember the reviewers who participated in the architecture review. There were a representative of the database architect (DA) team, a representative of the basic OPS team, and an architect who had just joined the company.

The architect's questions focused on whether the architecture could be used for one year or three years, while the questions from the representative of the basic OPS team were especially interesting. His first question was whether this system was in a critical path. I thought, isn't this question unnecessary? I answered right away, "Yes. It is right in the middle of a critical path."

Then, he asked whether this application could be downgraded if an exception occurred. I thought, isn't this question superfluous as well? This pipeline certainly cannot be downgraded. This pipeline was the core and most fundamental part of the system. Transactions were central to the company's core business. Maybe we didn't think alike.

He concluded that the system was in a critical path and involved orders, which represented the core business of the company. However, because it could not be downgraded, if the system failed, the entire company would be affected. Therefore, the review ended and my plan was rejected.

To be continued...

0 0 0
Share on

Alibaba Clouder

2,624 posts | 720 followers

You may also like