By Bi Xuan.
As a system architect, also spelled systems architect, myself at Alibaba, I have always felt that teaching about system design is actually far more difficult than teaching Java programming skills. In fact, I have long felt that a class concerning system design can easily become a class where we solely discuss theories which came to disappointing results.
I've often heard people say that even anyone without much experience in system design could easily teach how to design a system, but this is far from the truth. However, because of such a saying, I always felt a bit discouraged to be a trainer in this field. However, some situations over this past year gave me the courage to provide a private training program internally, which turned out to be popular. I think that I was the one who got the most out of this training program: I got to organize my ideas, learn a lot from my interaction with trainees, and better abstract and summarize methodologies of system design. It can be said that the courses of system design were jointly created by me and my trainees, given all the great feedback and input I received.
I set the following goals for system design training:
To achieve these goals, I will analyze the following issues before giving the class:
In the past, I did not think carefully about the conceptual framework of system design. In fact, many system design templates are a type of conceptual framework, but they are difficult to apply if you do not understand them. In this blog, I hope to help you not make the same mistakes I made in the past.
I reviewed my past system design and found that my best work tends to follow a routine throughout the process of designing a system. In this process, my thought process and design work follows the following sequence:
Purpose of system design > goals of system design > goal-centered core design > design principles formulated based on core design > detailed design of each subsystem and module
The purpose of system design is for what the system design is intended to be used. When engaging in system design, many people are unclear about why they are designing a new system, or why they are designing the restructuring or evolution for an existing system. This is not very good. Aimless system design may result in a series of deviations, and may also lead to results that run contrary to the original intent and purpose of the design when, for example, a single issue has to be solved by designing a new system or restructuring or upgrading an existing system. Bear in mind, of course, that a great architect should be capable of explaining system design to people of different skill levels in the field as well as to people of different backgrounds. System design can be better implemented by a team only after the system architect understands and clarifies the purpose of system design to all members of the team.
Next, it's important to consider the feasibility of formulating measurable goals based on the purpose of system design to minimize the deviation between the final system implementation and the original purpose of the system design. I believe that many of you may encounter situations where the final system implementation deviates greatly from the design. This is mainly because the whole team who implemented the system failed to formulate measurable goals for system design and to reflect these goals in system design.
This step involves how to achieve your system design goals in the design process through your technical expertise, perspectives, comprehensive consideration, as well as subjective principles of trade-offs and your ideas for solving problems, which all together are the critical elements for formulating the final core design. In the core process of the overall design process, you will formulate new goals to measure the final implementation of your design. You need to add these goals to the system design so that you can visualize the deviations between the final implementation and the original design.
After completing the core design, you can formulate design principles and ensure that these principles are observed and reflected in the subsequent detailed design of subsystems and modules. Doing so can make the overall system design more consistent and better integrated.
You will find it less complex to design subsystems and modules in detail, which is a task involving narrowing-down the range after the preceding steps, rather than redesigning the main system design. Programmers are usually good at solving problems, so I think that good mathematical skills are the primary requirement for programmers because mathematics is a typical problem-solving discipline.
At its core, my idea on how to enable you to master and apply the conceptual framework of system design is to share—during my elaboration of each step—my past errors and practical experience in system design. For me, I think that only architects with a lot of practical experience are capable of providing training courses in system design.
In the application part of the overall process, the method adopted here is to let everyone elaborate on his or her own system by following the same conceptual framework, to align with each other through interaction, and to make this practice a habit.
After a brief description of each step, I would like to elaborate on the first three steps.
Determining the purpose of system design is the first and also the foremost step of system design. Without setting a purpose from the start, you will most definitely go wrong in the subsequent steps. A typical problem of many system designs is the lack of deep analysis into the purpose and intent of the system construction.
System design is the process of designing a new system or significantly transforming or upgrading the architecture of an existing one, which serves a definite purpose. By analyzing the purpose of system construction, you can avoid problems from the gecko. System construction should reflect the challenges or problems at the service layer or system user layer, rather than meet your own personal requirements. Through purpose analysis, you can also ensure that the purpose is achieved in subsequent steps throughout the system design process.
By analyzing the purpose of system construction, you can easily identify a specific patterns and the depth of the system, which are two seemingly shallow but actually two rather down-to-earth concepts that show the impacting scope of a job that you have done. The impact may involve anything from the team in which you are a member, or the department, or the business unit (BU), business group (BG), or the cross-BG business module, or that entire group, or even the society where you belong. The norm here is to seek the truth behind the facts. That is, I do not recommend that you indulge in talks about theories while being at a loss for what to do in system design. Rather, I recommend that you get our hands dirty in the design of it all.
I conclude from my early experience in developing the high-speed service framework (HSF) that my lack of a clear purpose of system construction led me to make several serious errors. A typical error occurred during my effort to restructure HSF for a dynamic system architecture. If I had analyzed the purpose and intent of my effort at that time, I would have found that my work was solely out of my own personal intention to delve into the technology, rather than a need to solve the service challenges or problems facing HSF users. This error of mine relates to the issue at the beginning of the entire process that I talked about earlier. It is also found in many technical engineers when they initiate major system restructuring solely to meet technical demands. I used to be under the mentorship of an Alibaba senior executive who advised me to always clarify the purpose and cause of a job before engaging in it. This helped me realize that I can get rid of many complex technical details by clearly understanding the motivation before starting the work.
Due to my experience in HSF and Alibaba HBase development, I was able to better grasp the purposes of my subsequent work in developing Alibaba Cloud's Container Service, as well as scheduling services and active geo-redundancy applications. At the same time, I was also able to orient my work toward the need to solve the service challenges of Alibaba.
In most cases, the need for system design is proposed by other people outside of the system designer him or herself. With that said, the architect should be able to convert the need to determine to a design, that is, he needs to know whether to build a new system or reconstruct or upgrade the old system based on the demand or request, and he or she also needs to be able to understand the purpose of system construction. Only by doing so can the architect be able to explain the motivation of system design to the technical team to let them understand the value and meaning of system design.
In my opinion, we need to clearly analyze the purpose of system construction before conducting system design to ensure that system construction is of value and meaning and that the goals of system design can be achieved.
After clearly analyzing the purpose of system construction, we need to convert the purpose into a series of measurable goals to ensure that the system implemented in the end meets the purpose of system construction. I believe that many of you may encounter inconsistencies between the designed system and the implemented system, which result from a lack of measurable goals.
Let me provide two examples:
With clear and measurable goals of system construction, we can:
That is, you want to build a system to monitor the system construction effect. This is the most important piece of the puzzle of system design, but it can be easily ignored. For example, we demonstrated that the number of servers supporting services in containerized clusters and the volume of supported services, and we also compared these two figures with those of uncontainerized clusters. We also built a control system for active geo-redundancy to show the system deployment status and traffic switching status. We can ensure that the constructed system satisfies our initial motivation only by building a system to monitor whether the goals of system construction are met. Otherwise, we may run counter to the purpose of system construction after the system is implemented. Therefore, the monitoring system must be built during system construction.
The process of formulating goals based on a purpose is not theoretically complex but is easily ignored, resulting in problems during system design. The key is to formulate measurable goals and build a system to monitor the progress towards achieving the goals.
To achieve the measurable goals of system design, we need to identify core issues and solve them through system design.
I would like to talk about this topic based on my experience.
We had a clear purpose of HSF development from the start, with the measurable goal of supporting roughly hundreds of millions of service calls at a time. However, due to our limited technical skills at that time, the core issues that we refined greatly differed from the actual situation, which led us to constantly restructure and modify HSF. Therefore, I will never believe that a person with poor technical skills can be a good architect because it takes solid technical skills for an architect to draw boxes of a structure chart in a rational way.
The process of drawing may appear to be simple on the surface to people who only pay attention to the boxes. During the initial HSF design, we identified the core issue of how to implement a user-friendly and service-defined Remote Procedure Call (RPC) framework, but we ignored the issues of how to support hundreds of millions of interactive calls and what problems (such as complex troubleshooting) would occur in service R&D after servitization. For example, the issue of intermediate load balancing that was identified only after HSF was launched led to redesign of the HSF structure, and this issue would have been identified from the start by an architect with a broader breadth of knowledge.
In hindsight, to achieve the goals of HSF design, we need to solve the following core issues:
In hindsight, we formulated a clear purpose and related goals and properly refined issues in the T4 (containerization) phase, during which we solved the core issue of how to run 20 applications on a server. Most of the problems that occurred during T4 were related to the design scheme for the core issue. I will discuss this in the next topic about system design based on core issues.
We also formulated a clear purpose and related goals when developing active geo-redundancy. Although we tried to simultaneously carry traffic in multiple cities in China and to switch traffic within dozens of seconds, we could not eliminate the network latency that results from the physical distance present in active geo-redundancy applications. To enable dynamic traffic switching in active geo-redundancy applications, we need to solve the following core issues:
The conceptual framework of system design had been maturing over the recent years of our effort in unified scheduling, so we have a clear purpose and goals in this aspect. In light of the actual situation we encountered, we needed to solve the following core issues to implement unified scheduling:
The preceding cases indicate that technical skills are required to solve the core issues when measurable goals are mapped to the technical level. For projects and products of the engineering type, engineering experience is also essential in solving these core issues and is a direct measurement of an architect's competence.
We can start designing to solve core issues after we have set the purpose and measurable goals of system construction and determined the core issues. Many technical engineers tend to get straight to and have higher expectations for the design part without any purpose, goals, or core issues in mind. As a result, the finally implemented system fails to solve service challenges or does not function as intended, so I strongly recommend that system designers follow the specified procedures.
I will talk about how to design to solve core issues based on my cases. My previous experience reveals that I made quite a few errors and encountered many complex trade-offs in system design, but my experience also allowed me to gradually understand the capabilities that a competent architect should have.
The first core issue to solve in the initial stage of HSF design is to build an easy-to-use RPC framework able to support hundreds of millions of service calls per day.
I brushed away the issue of ease of use and thus made an error in the initial version, but fortunately it didn't cost much to correct the error. In this version, to publish a Spring bean as an HSF service or to call an HSF service, I needed to write a file to describe the service to publish or call, and put the file in a directory of JBoss. Although this method does not seem to intrude into the coding process, it results in a series of complex issues, such as where to write the file and how to automatically put the completed file in the directory during deployment. This method is modified to publish and call services by using a Spring bean in the second version. Although the modified method results in HSF-dependent service code, it standardizes the maintenance and deployment processes. Therefore, to design properly, we need to give comprehensive consideration to not only how to implement a system but also how the system will be used and how to run and maintain the system.
The second error that I made in HSF development is related to the RPC framework able to support hundreds of millions of service calls per day. This error taught me the biggest lesson in my coding career and even totally changed my technology selection style in subsequent design projects. Before engaging in HSF development, I had never built a system with more than 1 million visits per day and had no clue how a system with hundreds of millions of visits per day would be any different. JBoss Remoting was selected as the communications framework in the initial version of HSF simply because we used JBoss as the web container. However, this version experienced a serious fault when a major system went online, causing the response of the entire website to greatly slow down. We could not identify the cause after a whole day of troubleshooting and had to perform rollback, but we were sure that the fault must have been caused by HSF launch. We identified the cause one week after rollback. JBoss Remoting specified the default timeout period for remote calls as 60s, while the backend system was slow when processing some services, which caused the shared processing thread pool to fill up and thus slowed down the website response. This error made me realize that, to design a system with a high access volume, I need to be clear about the processing mechanism of the system because minor problems may escalate uncontrollably and cause faults under the burden of heavy access. Therefore, a system with a high access volume has high requirements for controllability, which makes such systems different from other systems with average access volumes. Controllability does not mean that you have to write the entire code but that you must be clear about the open-source code logic if used. To correct this error, I wrote a dedicated Mina-based communications framework for the HSF to process the connection method and thread pool in a special way. I have followed the principle of technical controllability during my subsequent projects of HSF transformation and other technical transformations.
When I talked about core issues previously, I mentioned that the refinement of core issues during HSF design was problematic, causing extensive rework on load balancing and post-servitization troubleshooting. These errors can be avoided and are no longer made by developers of service-oriented frameworks.
Load balancing was implemented by hardware load balancers in earlier versions, which caused several problems. One problem was that the virtual IP address of the service to call must be configured, for example, by using a central configuration server. Another problem was that HSF uses persistent connection, which further caused complex problems when the virtual IP address was used to connect to a backend cluster. For example, when the backend cluster published the restart action, the distribution of connections could become highly uneven, thus causing faults.
In addition to the preceding two problems, another problem that led us to transform HSF is that the hardware load balancers that were then used had reached their maximum traffic capacity, which was bound to happen and would crash the website. To remove this high risk and solve the preceding two problems, we decided to thoroughly transform HSF by designing software-based service registration, discovery, and addressing, which is the typical structure of a service framework system.
In hindsight, the defective load balancing function results from our non-comprehensive consideration of a system with a high access volume.
Our initial design did not take into account post-servitization troubleshooting. As a result, we had to invest a lot of manpower in fixing problems, but with low efficiency. To improve troubleshooting, we tried Dapper of Google but spent a lot of time on implementation.
Other HSF design problems exist. For example, the earliest communications protocol did not have the version number, which complicates compatibility processing during subsequent upgrades. A tougher problem is multi-language support.
HSF was the first core system with a high access volume that I had ever designed. I made numerous errors due to my insufficient technical skills at that time, causing countless revisions, faults, and repairs, but I made great improvement. In retrospect, I am grateful to my supervisor who gave me great tolerance and support. My experience in HSF design makes me realize that an architect must have deep technical skills at technology selection, extensive knowledge of design schemes, and comprehensive consideration to development, deployment, operation, and O&M to solve the core issues of design.
I did not have many difficulties in refining the core issues of T4, but made stark errors in the design that aimed to solve the core issues.
To run more application processes on a server than on a virtual machine, we isolated processes by means of hacking. Though applications were still functional, the hacking method made enumeration difficult after the applications were launched within a small scope with an established user base. We solved this problem only after we found LXC.
We also encountered many similar problems in the T4 phase, such as how to control disk space limits, which were handled by using the image method at first. However, the image method was not friendly in the case of disk space overselling, so we resorted to the dir quota method, which took us more than a month to implement because we had to write .cp files for online applications.
The errors I made in HSF design were mostly due to my lack of deep technical skills, while the errors I made in T4 design were mostly due to my narrow technical perspective at that time. I believe that a competent architect should have an open perspective in the technical field of interest, with sufficient knowledge in engineering and academics in this field, so as to select proper technologies to achieve the purpose and goals of design under certain constraints. I once wrote an article about how to expand our technical perspective.
In retrospect, my design of active geo-redundancy was more of a process of making choices based on my previous experience, with a controlled error rate. Therefore, I will talk about how to make trade-offs under constraints to solve the core issues of active geo-redundancy design.
The core issues of active geo-redundancy design are request closure and data consistency. To solve them, we referred to some engineering cases but found that our situation was quite different.
Here I will present some of the choices that had to be made for designing active geo-redundancy, so that you can think about and discuss them. I will not talk about my selection of logic.
If you are interested in active geo-redundancy design, you can search for more information on the Internet.
In the future I will elaborate on my work in designing unified scheduling systems and cloud-based systems for the past few years until recently.
Let me summarize the following capabilities that an architect must possess to define the purpose and measurable goals of system design, refine design-related core issues, and solve these issues:
With all of this said, I think that "architect" is not a generalized title. It takes a lot of effort as well as long-term practice, and also an accumulation of much experience before someone can become a qualified architect, especially an engineering architect.
As far as I'm concerned, system design is the most complex topic to give training in for me in my work. I am grateful to my colleagues who have attended one of my system design training courses before and have helped me with my writing and discussion on system design. Through all of this work, I've realized that system design is actually a rather down-to-earth job that requires skills which can be trained and learned over time.
Frank Zhang - April 28, 2020
Alibaba Developer - December 16, 2021
Alibaba Developer - March 8, 2021
Alibaba Clouder - January 4, 2021
Alibaba Cloud Storage - February 27, 2020
Alibaba Clouder - September 15, 2020
Customized infrastructure to ensure high availability, scalability and high-performanceLearn More
Visualization, O&M-free orchestration, and Coordination of Stateful Application ScenariosLearn More
Accelerate software development and delivery by integrating DevOps with the cloudLearn More
MSE provides a fully managed registration and configuration center, and gateway and microservices governance capabilities.Learn More
More Posts by Alibaba Tech