5-year Evolution of Ele.me's Transaction System – Part 6

In this blog, Wanqing talks in depth about the evolution of Ele.me, Alibaba's online food delivery service platform, drawing from his personal experience in the team.

The Alibaba Cloud 2021 Double 11 Cloud Services Sale is live now! For a limited time only you can turbocharge your cloud journey with core Alibaba Cloud products available from just $1, while you can win up to $1,111 in cash plus $1,111 in Alibaba Cloud credits in the Number Guessing Contest.

By Wanqing

Service and Business Governance

So far, I have mostly discussed improvements in technical details. However, I now want to look at two stories concerning major evolution in the application architecture. These events would determine the future trend of the application architecture.

Sales and After-sales in Negative Transactions

In the middle of 2016, we made improvements to improve the user experience in dissatisfied customer scenarios and shorten the settlement period. The validity period of a negative transaction was seven days, and therefore, the settlement system was strongly dependent on this time period.

Responding to JN's proposal, we separated negative transactions from the original order system and split the original order team into two teams. I recommended one engineer to be the leader of the new team.

For positive transactions, our main responsibility was to ensure convenience. Therefore, the positive transaction system must focus on high performance, high concurrency, and stability. In this context, distinct information about the primary and secondary nodes helps us quickly locate problems and resume services.

The concurrency of negative transactions was much lower than that of positive transactions. Only 1% of orders needed to shift to the negative transaction system. However, the branches and hierarchical relationship in the business logic for negative transactions were much more complex than those for positive transactions. In addition, negative transactions required more powerful business abstraction. Although stability and performance were important for negative transactions as well, they were less essential.

Due to the differences in core problem domains and service requirement levels between negative and positive transactions, splitting the team was a rational decision.

The actual splitting process was quite painful and involved many heated discussions.

The following figure shows the final flowchart from that time. Actually, the flowchart still has a problem. I took charge of the negative transaction team for the next few years and subsequently merged the sales process and the after-sales process.

Our first step was to add an order state to indicate the completion of an order. This step approximated goods receipt, but was conceptually different. We spent nearly three months adding this state and encouraging the upstream and downstream teams to adapt to the change and upgrade the applications.

Our second step was to set up a return system. After the order completion state feature was phased released, this state was used as the end of the lifecycle of an order. The return system took charge of subsequent processes. In this way, account receipts became independent of refunds in the clearing and settlement system.

Our third step was to switch the logic related to the sales service from the order system to the sales service system. I will provide more details about the evolution of sales and after-sales systems later.

One of the mistakes that we made at that time was that we did not completely separate states from upper-layer events. This mistake ultimately resulted in many problems on the business boundaries and in distributed transactions.

The backbone logic of the order system had actually been relatively simple. The main task of the backbone logic was to define the relationship between states, such as A->C, B->C, and A->B. In particular, the letters A, B, and C and the arrows indicating switching directions were all defined by the order system. The business meanings on this layer were lightweight. The point was that we considered *->C as a scenario where the upper layer took control.

For example, state C indicates that an order is invalid. Orders in any state except the completed state can enter the invalid state under certain conditions. These conditions are determined by the business format. Orders in this state can be retained in the sales service system, which determines whether to trigger the orders to reverse their states. Similarly, orders in the received state can also be retained in the sales service system.

At that time, the state machine already existed. We will describe the implementation of the state machine when we get to early 2017.

Actually, the process indicated by the red line is a compromise design in this latency-sensitive transaction scenario. The most important task of the process indicated by the red line was tagging an order to indicate whether the after-sales service was enabled. At that time, e-commerce systems such as Taobao.com and JD.com vertically showed this process on their client pages. This system design was much simpler. However, our business pattern determined that we could not do this. Merchants must receive orders in a very short time and keep an eye on abnormal cases. After a trade-off between many pages, we determined to prioritize user experience. That is, although we had split the system, we didn't split the business at the top layer. Actually, many colleagues complained that they did not understand why they have to bother identifying, differentiating, and connecting two systems for a simple refund function. Ultimately, we wrote back some of the data to the order system.

At this stage, we were guided by two sayings:

Attack problems, not people. Despite fierce discussions, everyone just wants to do things better. The bottom line is that we should not attack people. Actually, sometimes we can resolve conflicts by having a cup of tea together.
Insist on making things better because nothing is ever perfect. We need to make our best effort and then work to solve any problems instead of complaining about past decisions. Another important mantra is "know when to fold them." These two sayings do not contradict each other. Both require decisiveness.

Integration with Logistics

At the beginning of August, I took over the service logic of RabbitMQ. The first thing I did was to reconstruct the RabbitMQ service logic because both the design concepts and language stacks were different from those in the order system.

Let's first look at two outdated architecture designs.

ToC, ToB, and ToD

At the beginning of 2016, a term was popular that is now unknown to most people: BOD.

This was Ele.me's self-delivery pattern in the early stage. This service coupled orders, stores, delivery, and settlement. Ele.me's large logistics system had been built up from the middle of 2015. At the beginning of 2016, we started the project of decoupling BOD.

After this decoupling, service packages, ToB orders, and ToD orders were introduced.

Allow me to briefly explain the business background. At that time, the platform packaged and sold services to merchants and signed contracts with them. These services included the delivery service. Subsequently, whether a merchant used the delivery service affected the merchant's commissions and accounts receivable. The innovation here was that, when a merchant receives an order, the merchant is informed of the estimated amount that they will receive upon the completion of the transaction. In other words, the merchant can see a preview bill that should approximate the actual bill if no exceptions occur during sales. In addition, the merchant is reminded that the details on the final bill will prevail.

This actually involves the logic of account splitting and profit splitting. In other words, the clearing and settlement system was introduced to the transaction pipeline. However, the clearing and settlement system generally deals with non-real-time services. After arguing for several days, the order team ultimately took over the task of calculating the expected income of merchants. At that time, we worked with many engineers who left Ctrip. Ctrip's business pattern was as follows: A user placed an order on the platform and then the platform placed an order from the supplier. Therefore, the terms ToC, ToB, and ToD were introduced.

I was assigned the task of developing a ToB order system. At that time, I felt that this pattern was incorrect because transactions in Ele.me were different from those in Ctrip. I expressed my opposition to this plan to my director, but was overruled. After all, I was a new graduate and was unable to propose clear and persuasive reasons. Some other people also hesitated. However, the ToB order system officially began phased released in early March.

From the preceding figure, you can see several obvious problems:

A transaction was split into several segments. However, users and merchants actually need to perceive all the segments. In addition, each stage has certain requirements for timeliness and consistency.
The platform and the logistics system interacted with each other only through the channel indicated by the red line. Therefore, this channel had to deal with a heavy load.
Data was synchronized offline by using a formula.

ToD

After we implemented the preceding architecture, the ToD part became the only channel between the platform and the logistics system by July. This meant that the ToD part was under a heavy load. However, the business had not developed to that magnitude. As a result, the ToD part did more harm than good. The merchant client delivery team, the logistics team, and the order team were all unhappy.

It was at this time that the order team was adding the completed state feature. We thought that we should extend the controlled order lifecycle to the delivery stage and, as a sub-lifecycle, make the delivery stage a part of a transaction. Therefore, I was happy to go back to refactoring after the ToD system was handed over to me at the end of July.

From the perspective of external personnel who worked with the merchant client, the ToD design was no user-friendly at all.

When we actually took over the system, we found that the application architecture on the merchant client looked like the following figure.

The architecture included a public infrastructure layer that packaged public operations on databases and Redis. In other words, the business logic and data in the same domain were divided into services at different layers according to the layering rules of the system. To perform operations on its own data, the business layer in a domain had to call specified APIs. This architecture may be reasonable, and some companies actually use such an architecture. However, after this architecture was released, it caused a lot of problems. Due to the complex coupling, we had to try to during a complicated maze into a simple and independent pipeline.

Then, we transformed the architecture to that shown in the following figure.

We merged ToB and ToD into one layer in the osc.blink service and eliminated these two concepts by using them as extended data of the order system, instead of as parts of a transaction.
Data interaction between the platform and the logistics system does not necessarily pass through this connecting layer. We recommended that this pipeline carry only the data required for delivery in real-time pipelines. The logistics service Apollo can obtain required data from other locations on the platform. Actually, some unsolved problems remained. The osc.blink service and the Apollo service defined each other differently. As the waybill center, Apollo collected all the data used to connect to the platform.
The interaction between nodes was simplified, and the robustness of nodes was intrinsically ensured. Originally, orders were pushed by using messages. After transformation, orders were pushed by using RPC. Producers can actively push orders by using a token to ensure idempotence while consumers pull orders in a compensatory manner.

Pipeline 3.1 shown in the preceding figure was packaged by this service because the data center of the takeout platform and the data center of the logistics platform were deployed in different cities and processing requests across data centers multiple times adversely impacted the user experience.

By the end of August, the order call part had been released. In September, we started restructuring the data.

Summary

By the end of 2016, the overall transaction system was as shown in the following figure.

At that time, we developed some good habits and approaches:

Divide permissions and responsibilities, including recycling permissions for the code repository, release permission recycling, and for controlling connection strings of databases and message queues.
Keep it clean. Regularly clear out useless logic. For example, every one or two months, we would clear unused APIs and check APIs with abnormal traffic growth. This was necessary because the downstream owners can perform operations at will.
- Promptly clear out useless configurations. Otherwise, no one will dare to clear these configurations after the system is maintained by people who were not involved in writing them.
- Promptly handle exceptions and error logs to reduce distractions from alerts and troubleshooting.
Pursue perfection but keep your feet on the ground.
Adhere to the standards and execution mechanisms of testing systems.
- Adhere to automation construction.
- Adhere to performance testing.
- Adhere to fault drills.
Keep learning, communicating, and thinking.
Keep it simple and easy.
Attack problems, not people.

It is best that architecture evolution be driven by businesses. It should be forward-looking, not accident-driven. Looking back, we found that half of the evolution actually resulted from accidents. Fortunately, at the time, we could freely manage technologies in most cases.

If you have read this article and have similar thoughts but are reluctant to write them down, that means you are already starting to organize your thoughts in your head.

During the six-month internship, I had a new experience every month. During the first year and a half after graduation, I always was ashamed when I thought back to how weak my skills were just three months ago. However, the first two years after graduation was also the most precious time I have had at Ele.me.

Community

5-year Evolution of Ele.me's Transaction System – Part 6

Service and Business Governance

Sales and After-sales in Negative Transactions

Integration with Logistics

ToD

Summary

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

Black Friday Cloud Services Sale

DevOps Solution

ApsaraDB for MyBase