By Xiweng
Nacos 2.0 improves performance about ten times over by upgrading the communication protocol, framework, and data model and solves the performance problems exposed following the release of Nacos 1.0. This article comprehensively compares the performance of Nacos 1.0 and Nacos 2.0 during the upgrade process by stress testing Nacos 1.0, visually demonstrating the performance improvement of Nacos 2.0.
We purchased a three-node Nacos cluster with a 2-core CPU plus 4 GB of RAM from Alibaba Cloud Microservices Engine (MSE).
We adopted the gradually increased stress method for stress testing to demonstrate the system performance at different scales. We divided the stress into three batches to start the cluster gradually. Then, we observed the performance of the cluster under each batch. At the same time, a Demo of the Dubbo service will be added outside the stress cluster, and Jmeter will be used to call the service continuously under a stress of 100 TPS to simulate the possible impact on business calls under different stress.
During stress testing, the server and client will be upgraded in due course. The server will be upgraded using the one-click upgrade function provided by MSE, while the client will be upgraded through restarting in batches in turn.
First, start the first patch of stress clusters to exert stress on MSE Nacos 1.2.1. Under the stress of 6,000 Providers, the CPU is about 25% when the cluster is stable and can maintain 6,000 instances stably.
Next, start the second batch of stress clusters, adding 4,000 Providers and a total of 10,000 Providers. At this point, the peak CPU of the cluster has reached 60%, and the stable operation is about 45%, so the cluster can run stably.
Under the stress from the first two batches, no stability problems occurred in the clusters. Therefore, the Dubbo call remains normal without any errors.
After the third batch of stress clusters is started, the total stress for the cluster is 14,000 Providers. At that time, the cluster registered 13,000 instances for a short time. Then, the number of instances dropped quickly, with the CPU running full. Moreover, by narrowing the time range, you can notice how the falling instance is still jittering on a small scale.
Meanwhile, the Dubbo call error occurred. The log in the Consumer indicates that the Dubbo Provider was removed due to stress on the server. Therefore, a No provider error occurs when the interface is called.
Since a double-write operation of instances is performed during the server upgrade, the number of instances stored on the server during the upgrade will be twice the actual instance value. According to the testing results above, it is necessary to roll the number of instances back to the first batch of 6,000 instances or try to upgrade after upgrading the configuration and scaling the machines. This article uses a rollback pressure approach to stop the post-start stress clusters first and let the clusters return to normal before executing the upgrade.
As shown in the monitoring chart, after the two batches of stress clusters were stopped, the clusters returned to normal quickly and operated stably. Besides, Dubbo calls also returned to normal. After that, you can upgrade with the upgrade function of MSE. During the upgrade, the CPU jitter is caused by the performance loss of the double-write operation. Furthermore, doubling the number of instances due to the double-write operation is equivalent to the limit stress of 12,000 instances. However, there is still certain jitter on the server, causing some Dubbo errors. Such effect will not be available if the upgrade is done under non-limiting stress.
With the server upgrade completed to stop double-write operation, the performance loss caused by the double-write operation is eliminated. Also, the CPU utilization decreases and tends to be stable. Besides, the number of instances is no longer jittery, so Dubbo calls are fully recovered. As Server 1.X, we put stress on the clusters in two batches to compare the performance of the two versions when the stress is the same.
Since the client is still version 1.X, the server usage level is still very high, with almost 100% CPU after a full stress start. Although there was no massive instance drop like in Server 1.X, a small amount of instance jitter still occurred after running for a while. This means that upgrading the Nacos server to version 2.0 alone will offer certain improvements, but it does not solve the performance problem completely.
The client of the stress cluster also needs to be upgraded to a version later than 2.0 to utilize the full performance capabilities of Nacos 2.0. We will issue a replacement procedure in three batches. During this process, it is normal for the server to have instances fall and then recover as the Provider is restarted. With the upgrade of stress clusters, we can see that the CPU has decreased significantly. Finally, when the stability has been reached, the CPU initially decreased from nearly 100% to 20%, and the cluster ran 14,000 instances stably.
As described above, we can obtain the performance differences of the three-node cluster with 2-core CPU plus 4 GB of RAM in different versions:
From the table, we can see how Nacos 2.0 has improved its performance significantly. It is recommended that new users adopt Nacos 2.0 directly. Older users are advised to upgrade the Server first and then upgrade the client gradually. From the perspective of stress testing, we can learn about the performance of different versions at different stages:
Simplify Cloud Application Delivery as KubeVela Becomes a CNCF Sandbox Project
dubbo-go-pixiu: The Mythical Animal of Cross-Language Calls in Dubbo
495 posts | 48 followers
FollowAlibaba Developer - May 31, 2021
Alibaba Cloud Community - May 6, 2024
Alibaba Cloud Native Community - June 18, 2024
Alibaba Cloud Native - February 15, 2023
Alibaba Cloud Native Community - October 31, 2023
Alibaba Cloud Native Community - March 21, 2024
495 posts | 48 followers
FollowHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreProvides comprehensive quality assurance for the release of your apps.
Learn MoreA HPCaaS cloud platform providing an all-in-one high-performance public computing service
Learn MorePenetration Test is a service that simulates full-scale, in-depth attacks to test your system security.
Learn MoreMore Posts by Alibaba Cloud Native Community