Today, I will walk you through the evolution of Toutiao's architecture. We have shared some articles previously that talked about specific technical details, but in this article, I will focus more on the infrastructure and architecture side. Our mission is to help our architecture engineers create better iterations by providing better infrastructure.
From an architectural point of view, the pressure on the technical team stems primarily from three aspects:
Toutiao is growing by leaps and bounds. Despite being established four years ago, the number of employees and the size of the business has increased tremendously. The pressure to launch business plans quickly and ensure stable and available services has also parallelly increased. The engineering team encounters multiple issues with regard to downtime including low server performance during large-scale campaigns, failure of core services and so on. The question is how to cope up with such issues?
First, let me explain the evolution of architecture. Companies face a variety of challenges and business pressures over time. Small companies face challenges such as low QPS and low business volume. As companies grow, they don't face issues with the servers, but issues around stability. They need to focus on fine-tuning and addressing the additional pressure resulting from such growth. They also need to focus on providing a more stable environment. That said; the evolution of architecture is a process that continues in perpetuity.
Why is there such a massive pressure on Toutiao? Toutiao is growing at an extremely high speed. As you can see in the figure above, although it has been in business for just four years, its number of DAUs has doubled year over year from 2014 to 2016. This has created a significant challenge on the business side. As we scaled up, it was difficult to scale our original architecture linearly, and some linearly scalable services were done so with a variety of problems. Since the business grew phenomenally, there was considerable pressure on the architecture's backend.
How was Toutiao's architecture developed? No perfect architecture is sustainable forever. Instead, an architecture must be dynamic and change in real time. Because of the qualitative changes caused by quantitative changes, different stages of development require different architectures.
When do you need to make an architectural transformation? In general, when companies encounter an increase in the number of system problems, accidents, and, alarms, or decrease in communication efficiency, it is very likely that the underlying architecture has issues.
A small problem in architecture can take up a lot of time to be modified.
Moreover, as the business scales up, the burden will also increase. Many people would relate that it is easy to come up with product ideas, but very difficult to actually implement these ideas and develop a good software program. Technical transformation is a long journey that can extend for years and therefore, as a starting point, focus on increasing the speed of iteration. An architecture will eventually deteriorate and therefore rather than aiming to create a perfect architecture from the get-go, it is better to aim for agility. Agility and speed of evolution are more important. An architecture will inevitably deteriorate.
Toutiao started out as a simple web application with the goal of building a database and implementing the business. Its initial advantages lied in its recommendation engine, data mining, and offline computation capabilities. The online service was relatively straightforward in the frontend and required just three layers. When the business launched its operation, there were no issues, and it was enough to scale horizontally when the volume of access requests grew merely.
The architectural evolutions of most companies look very similar: Companies focused on splits whenever the previous versions encountered performance problems, In the process of optimization, the code of the heavily-loaded parts was divisible. In the figure above, A, B, and C are different businesses. They combined their code at the beginning, but as it evolved, the code was heterogeneously split, which became very painful for products after one or two year's iteration.
The architecture in the first stage did not take into account the growth in staff size or business scale. There was no specific personnel responsible for architectural optimization in the first place, and most employees focused on the actual business by adding functions. For example, the recommendation engine will strengthen if recommendation results don't match with the expectations. There were no dedicated resources or a separate DevOps team to organize the overall architecture.
By the end of last year, there was a complete utilization of budgeted resources for each quarter as early as in the second month of the period. There was a 60% to 70% pressure during peak times. There were two problems here: first, the specific components experienced performance degradation; second: the pressure on the business was too high.
The architecture team had to find ways to increase their efficiency to ensure service continuity even in the event of access problems, high pressure, and insufficient resources. The transformation costs and burdens increased drastically as the business grew. We proposed microservices as a solution for our next stage due to these problems.
At present, our idea is to create a new architecture through microservices. This development is possible by breaking systems into subsystems, splitting large applications into small ones, and reusing the code across different layers.
The layering of systems is quite common. We are focusing on infrastructure and hope to improve iteration speed, disaster recovery, and other tasks through infrastructure. We hope that all the business teams can make business iterations and structural adjustments more efficiently.
The most critical aspects of microservices in our opinion are as follows:
In practice, the key to microservices is autonomy. Although microservices are autonomous and self-contained, they need to have a hierarchy. For example, if a service you provide originates from a third-party company such as Weibo, you cannot just go to Weibo and ask them to make changes to your service. Microservices must have boundaries at the company level and cannot be made too independent; otherwise, the cost of communications will increase. It is best to build reusable infrastructure and specifications.
What are microservices?
Finally, I will talk about how Toutiao implements the concept of servitization and how we provide services to developers across various business teams.
Outlined below are Toutiao's main servitization ideas
The service center provides service information and indicates the used. It makes other people's work easy as they can call these services. The owner can manage the information about service quality provided by such services. For example, connect the services while using Redis for load balancing service.
There is a crucial concept between services: service authorization. Usually, when we start a service, we can connect to it via IP address. Authorize the database via username or IP address. However, many intranet services have fewer restrictions, and therefore all services do not record the association between services in a global topological manner thereby ensuring that they are
We can specify an owner for authorization through the interface that service provides. This will ensure that the SA service provides an interface, and we can specify an owner for authorization, so that the service is only accessible to other services after their authorization.
Descriptors: What does this service look like? What is the maximum QPS? We can use descriptors to identify the issues. For example, when the user information service is unable to cope up with the requests, it can reject them and allocate the resources to other services to accomplish more tasks. Descriptors can also store the d.
Service authorization ideas:
Let's use Thrift as an example. There are two dotted lines on both the sides. The service center has strong horizontal scaling capabilities which will allow me to ask information about necessary authorization. You might ask me whether I can call this service or not? The default setting is yes, and it is just a Thrift package. I know who you are and you have your policies which can be brought here through the service package. The request is brought up, and the call is analyzed to check whether there are any problems. This is also a part of the specifications. Developers don't need to worry about how the framework is created.
The other call is from the service center. However, since the QPS is under pressure, it won't be able to support and reject the call. The good side is you would not waste resources. Let's now consider virtualization of Docker. The previous idea was to authorize by IP, make controls per IP, and provide an anonymous service based on the IP of the node. It is ambiguous to get an identifier using . Moreover, it is also tricky to use the network layer. However, there is a certain degree of credibility in the internal network environment.
Outlined below is a MySQL scheme on which we are currently working. Unlike Redis, which requires the statement of your identity, calling MySQL requires passing the identity of the requester. A vital database needs security authorization, and what I just spoke about is under normal circumstances. While these methods are superimposed to bring the original information, leverage Redis for weighting checks.
Redis can't do that at the protocol layer, but adding the information mentioned above in the MySQL call doesn't affect the semantics. If our servers provide an HTTP interface, it is viable to add this information in the HTTP header for authentication and authorization.
A once there is an authorization relationship. To call a service, its pre-authorization of service. If the real topological relationship is online, we can do alarm optimization. When Redis or MySQL are alarming, having this topology will speed up problem identification.
With this topological information, we will know the global meta-information of the service and be able to assess the impact of service changes better aFigure 14
We have developed our own RPC framework. The development of the framework helps us to generate code. We have dedicated a lot of resources to this. The main features of the framework include:
For example, when we speak about observability all the services can expose their internal states. This is beneficial since it provides an analysis of internal or service ports by default after the launch of services. The framework automatically analyzes the service states based on the topological relationship. We can analyze the performance. Developers can naturally obtain these capabilities without worrying about such things.
In addition to this, we have a platform to manage the relationship and downgrade for disaster recovery and overload protection..
Below is a schematic diagram of the approximate module. The modular approach used here reduces as compared to it into the framework.
Similar to Docker, we can also develop by leveraging containers. However, it is not sufficient to run the services in the containers. We open up the servitization system. This will also act as a business enabler.
Finally, how are we planning the virtualized PaaS platform?
We implement it through three layers and manage it using the PaaS platform. We provide a universal SaaS service while offering a common app execution engine. The bottom layer is the IaaS layer.
IaaS manages all the machines and integrates the public clouds. Toutiao has many hot topics to publish and push across the country, causing high bandwidth requirements. We, therefore, use public clouds for assistance and uniformly abstract the required computing resources. The infrastructure combines the concepts of servitization, such as logging and monitoring, so the business teams can enjoy the capabilities provided by the infrastructure without worrying about the details.
Q: You just talked about splitting standalone services into microservices. Will that increases costs? How do you approach this?
Xia Xuhong: I previously built a database and ran it directly. In the past, I had to upgrade the entire database, but now I can upgrade only a small part of it. When the business is relatively simple and small-sized, standalone services mean lower costs, but as the company grows and the number of machines increases, independent services will become bottlenecked. However, if I standardize the microservices, I can manage them with automated tools and platforms without manual intervention, hence, costs reduce.
Q: Have you considered running the services in the containers and registering their corresponding container and IP information in Consul to update the authorized ACL?
Xia Xuhong: This is indeed an idea. We use Consul to decentralize, but it would also add an extra layer. If you need to control the access and security of the microservices, you need to make the container nodes hierarchical. For example, I will assign them to small clusters where the physical layer is isolated, and security can get ensured in this way. Consul alone is not enough.
Q: What is the RPC service discovery? Is RPC self-implemented?
Xia Xuhong: Service discovery is Consul, RPC is self-implemented by Thrift, and the fuse mechanism also gets implemented on the service calls.
Q: Why didn't you use open-source tools when selecting a model? We are preparing to transform our entire platform architecture into microservices, and we want to choose the architecture of the services. You are doing this on your own, but we have to choose open-source and self-building resources. Can you give an example?
Xia Xuhong: Looking at the scenarios, you can't consider all kinds of open-source solutions right now. We also have some unique situations. When integrating open-source tools with internal services, we have to consider the costs of integration and maintenance. In many cases, open-source projects take more features into account due to their intention to be universal, leading to more complex code. However, some features are useless for us, so we need to adjust them to make them less complicated as a whole.
We also make the authorization standard based on the service identifiers in the servers, and we didn't consider scenario interconnection.
Q: If we are currently undertaking a microservice platform transformation, will there be significant changes in the business system development model? Our platforms will change regarding development model and design. What experience have you gained since starting the transformation?
Xia Xuhong: We have not completed the transformation work yet as it is challenging. You have first to set a general trajectory communicate the trajectory and reach an agreement about how to make adjustments to lower volatility. Or, you can get the most out of the functions first, so only a small migration is required, reducing the costs of migration.
Read similar blogs and learn more about Alibaba Cloud's products and solutions at https://www.alibabacloud.com/blog.
Alibaba Developer - October 13, 2020
Alibaba Container Service - March 10, 2020
Alibaba Clouder - November 11, 2020
SeanLiu - April 17, 2020
Alibaba Clouder - November 26, 2020
Alibaba Clouder - September 24, 2020
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
Alibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.Learn More
More Posts by Alibaba Clouder