Assistant Engineer
Assistant Engineer
  • UID626
  • Fans0
  • Follows1
  • Posts53

The evolution of developing an architecture which supports ten million users by Alibaba Cloud

More Posted time:Mar 29, 2017 14:50 PM
A good architecture is developed by evolution, instead of merely relying on design. It's difficult to take all factors such as high performance, high scalability and high safety of the framework into consideration in all aspects at the beginning of architecture design. With the ever growing business demands and access pressure, the framework evolves gradually, so that a mature and stable large-scale architecture is developed. The architectures of large-scale websites such as Taobao and Facebook all develop from a small architecture to a large-scale website network through constant evolutions.
With the approaching of cloud computing, the current age of IT is being transformed into the age of DT. How can we build an architecture for tens of millions? This article will share with you the process for a small website to gradually evolve into an architecture for tens of millions based on the best practices of Alibaba Cloud.

The preliminary stage: universal stand-alone servers
At the preliminary stage of architecture building, only one ECS server is able to solve all problems. One ECS server is enough for the application of traditional official website and BBS. The corresponding web server, database, static file resources and others can be deployed to the same ECS. Basically, stable operation can be ensured for a PV of 50,000 to 300,000 in general, with the help of optimization in kernel parameters, web application performance parameters and databases.
Architecture with one ECS server:

The fundamental stage: physical separation of web and database
When the access pressure reaches 500,000 pv to 1,000,000 pv, competition on system resources such as CPU/memory/disk space/bandwidth will rise for web applications and database services deployed on one server. Obviously, the bottleneck of a stand-alone server has occurred. Then, we can deploy the web application and database separately physically to solve the corresponding performance problems. The architecture here adopts ECS+RDS:

The stage of separated static and dynamic files: static cache + file storage
When the access pressure reaches 1 million pv to 3 million pv, the performance bottleneck of front-end web server can be observed. A great amount of web requests are blocked, and the CPU, disk I/O and bandwidth of the server are all under pressure at the meantime. At this time, on the one hand, we can store the related files of the images, JS, CSS, HTML and application services to OSS. On the other hand, we can distribute and cache static resources through CDN to each node to realize "nearby access". With separated access of dynamic requests and static requests ("separated dynamic and static requests"), the access pressures on servers in respect of disk IO and bandwidth are solved effectively.
The framework adopts CDN + ECS + OSS + RDS:

The stage of distributed architecture: load balancing
    When the access pressure reaches 3 million pv to 5 million pv, although the pressure of static requests has been effectively separated through "separated dynamic and static requests", the pressure of dynamic requests is "too much" for the server. The most obvious phenomenon includes the blocked access at front end, delay, increased processes on the server and 100% CPU usage, along with frequent errors of 502/503/504. Obviously, a single web server cannot meet the demands any more, and multiple web servers added by load balancing technology are required (different available zones can be selected for ECS accordingly to further ensure the high availability). The age of stand-alone server is over, and it steps into the stage of distributed architecture.
The architecture adopts CDN+Server Load Balancer + ECS + OSS + RDS:

The stage of data cache of the architecture: database cache
    When the access pressure reaches 5 million pv to 10 million pv, although the performance pressure imposed by dynamic requests has been solved by multiple web servers working in concert with the Server Load Balancer, the pressure bottleneck on the database is observed, with a common phenomenon being the increased and blocked RDS connections, 100% CPU usage and IOPS surge. At this time, we can reduce the access pressure to database effectively through the database cache so as to further improve the performance.
The architecture adopts CDN+Server Load Balancer +ECS +OSS + cloud database MemCache +RDS:

The stage of extended architecture: vertical expansion
    When the accesses reach 10 million pv to 50 million pv, although the performance problem of file storage has been solved through the distributed file system OSS, and the performance problem of access to static resources has also been solved through CDN, bottlenecks persist on the web server and database when the access pressure is increased further. Therefore, we further split the pressure on the web server and database through vertical expansion to solve performance problems.
"What is vertical expansion? We can deploy different businesses (or databases) to different servers (or databases), and this is called vertical expansion."

The first step of vertical expansion: separation of business
On the business layer, different functional modules can be separated to different servers for independent deployment. For example, user module, order module and commodity module can be divided and deployed to different servers.

The second step of vertical expansion: read-write splitting
On the database layer, when the database is still under a higher pressure even with caches, we can further divide and reduce the pressure on the database through read-write splitting.

The third step of vertical expansion: database partitioning
Basing on the divided business and read-write splitting, on the database layer, we can store the database tables involved in the user module, order module, and commodity module, such as user module tables, order module tables, and commodity module tables, into different databases, such as the user module database, order module database and commodity module database. Then we can deploy different databases to different servers.
The architecture adopts CDN+Server Load Balancer +ECS +OSS+ cloud database MemCache + RDS read-write splitting:

The stage of distributed architecture + big data: horizontal expansion
    When the accesses reach over 50 million pv, the tens of millions of accesses in the real sense, we can find the vertically expanded architecture "exhausted". For example, the read-write splitting only solves the pressure of "read", but the "write" pressure on the database tends to go beyond the power of the database, and the bottleneck emerges. Besides, although the pressure is divided to different databases through database partitioning, the number of data records in a single table can reach the TB level, which reaches the limit of the traditional relation-based database obviously.

The first step of horizontal expansion: increased web servers
When the subsequent pressure is increased after the businesses are vertically split and deployed onto different servers, more web servers can be added for horizontal expansion.

The second step of horizontal expansion: increased Server Load Balancer
A single Server Load Balancer still has the risk of single points of failures, that is, the Server Load Balancer also has performance limits, for example, the maximum QPS of 50,000. Through DNS polling, the request is polled and forwarded to the Server Load Balancer in different available zones to realize horizontal expansion of Server Load Balancer.

The third step of horizontal expansion: using distributed cache
Though Alibaba Cloud MemCache database is in a distributed structure, the single portal still has the risk of SPOF. Moreover, performance limit also exists, such as the maximum throughput of 512Mbps. Therefore, we deploy several ApsaraDB for Memcache instances which can cache the data at the code level through hash algorithms to different c ApsaraDB for Memcache instances separately.

The fourth step of horizontal expansion: sharding + NoSQL
The traditional relational database is not suitable for the demands of high concurrency and big data. Therefore, the corresponding distributed database of DRDS (MySQL sharding distributed solution) + OTS (distributed database based on column stores) is used to solve the problem from the root.
The architecture adopts CDN+DNS polling + Server Load Balancer + ECS + OSS + cloud database MemCache + DRDS+OTS: