Common principles for high-performance architecture - Well-Architected Framework

Performance is a critical metric for any system, as failure to meet user expectations can result in significant user attrition. Often, performance issues are closely tied to the initial architectural design of the system. Consequently, any architectural design must take into consideration potential performance concerns.

Given the pervasive nature of performance issues, there are numerous approaches for performance optimization. From user browsers to databases, all application-related components that impact user requests can undergo performance optimization. As cloud computing becomes more popular, more and more users are running their core business systems in the cloud. Therefore, the architectural design and selection of cloud solutions play a crucial role in performance.

Based on the characteristics of cloud environments, the following considerations and fundamental principles pertain to performance design in cloud architecture:

Effective Cloud Resource Selection

As a software system, there are various considerations for performance optimization and architectural design at the application layer. However, the foundation of a system's performance lies in the performance of its underlying computing and storage resources. These are the atomic capabilities of a system. When a system utilizes computing and storage resources with better performance, it facilitates overall performance tuning and optimization at the higher application layers.

With the continuous development of large-scale distributed systems, there is an increasing demand for computing scenarios and computational power. For example, in many general computing scenarios, the focus is primarily on the scale of the computing cluster, rather than the individual node's capabilities (such as scenarios with short bursts of high-volume computational requests). On the other hand, in the current popular AI model training scenarios, it is necessary to use bare-metal machines with specialized computing cards like the A100 GPU to meet the demands of large-scale AI training. Additionally, cloud resources are typically deployed on a per-availability zone basis. Once a significant deployment of resources is made in a particular availability zone, the cost of migration and reconfiguration becomes high. Therefore, selecting the appropriate availability zone is also crucial. When choosing an availability zone, factors such as latency, inventory, and resource types need to be considered.

In the increasingly diverse landscape of computing scenarios, the initial selection of cloud resources can significantly impact the final system performance. Therefore, from a practical perspective of cloud architecture, resource selection is of utmost importance. The principles for resource selection can be guided by evaluating and assessing suitable cloud services for the specific requirements of the system.

Scalable and Expandable Cloud Architecture

In large-scale systems, it is not feasible to handle the high concurrency of user requests and store massive amounts of data using a fixed number of servers. Such an approach is neither cost-effective nor capable of effectively handling flexible business requirements. By utilizing a cluster, which combines computing and storage resources, it is possible to dynamically adjust these resources to alleviate the computational and storage pressures caused by high concurrency. This enables the system to provide stable and efficient services to users during peak periods while releasing unnecessary resources or maintaining low-level operation during periods of low demand to optimize IT expenditure.

In a well-designed system, different functional nodes exist, such as application server clusters, cache server clusters, and database clusters. For stateless scenarios where data is stored on separate nodes, scaling the application server cluster is relatively straightforward. However, for cache server clusters, newly added computing nodes may require cache refreshing or preheating to ensure data accessibility. Scaling a database cluster in real-time is more challenging and requires proactive measures such as data backup, data synchronization, and auxiliary routing techniques to enhance the overall availability and performance of the database cluster.

Partial Best Practices for Cloud Architecture Design

With the high development of cloud computing today, a large number of core business systems are running on public clouds. These systems have taken into consideration the inherent advantages of cloud computing in their design and logic and the development of public clouds also absorbs the requirements of these business systems to continuously optimize their product designs, and launch cloud products that better meet business needs. Therefore, some best practices have been formed in certain special scenarios, and through these best practices, the architectural design capability in the early stage of cloud architecture design can be effectively improved, and the overall system performance can be improved. For more information, please refer to the best practice related content in each cloud product document.

Focus on Architecture Design Precautions

Performance does not need to be blindly pursued to the extreme. In the process of designing high-performance architectures, it is important to consider the challenges and considerations in performance design. This helps avoid unnecessary resource wastage and development investment. For more details, you can refer to challenges and precautions.