All Products
Search
Document Center

Well-Architected Framework:Resource Usage Optimization

Last Updated:Nov 01, 2023

Resource usage optimization is a necessary and effective method for cost optimization. Poor utilization of cloud resources is often caused by lack of cloud experience and technical debt, such as low level of digitalization and lack of experience in traditional enterprises, which leads to a lack of effective cost insights and cost control methods when introducing cloud-native architectures; or technical debt caused by improper use of cloud-native technologies, resulting in increased costs. In addition, in addition to the resources running in a stable state, cloud services will also have new machines added every year. In general, as the business grows, the scale and cost of resources will increase year by year. In the face of such scale growth, resource management needs to focus on overall efficiency improvement and improve overall resource utilization and cost savings through efficient cluster and resource management.

Typical resource optimization scenarios are as follows:

Infrastructure Cloud-Native Before cloud-native transformation, traditional IT architectures are migrated to the cloud or need to be transformed into cloud-native architectures. This stage is a major change in infrastructure and requires cost planning and governance at key moments. This stage needs to focus on the following:

  • Recommend using stress test tools to evaluate the capacity of the business system using ACK cost insights.

  • Select proper instance types for the infrastructure.

  • When purchasing infrastructure, use discount privileges to save costs. For more information, see Savings plans.

Application Cloud-Native Transformation During the cloud-native transformation, applications after cloud-native transformation can leverage the elastic scaling and deployment mechanisms provided by Kubernetes, and have the advantages of high availability and stability. At the same time, when the business traffic dynamically fluctuates, containers can be scaled up and down according to the standard deployment unit that carries the business, so as to have the ability to apply intelligent strategies to meet resource cost requirements. Stable Operation of Cloud-Native Business After the cloud-native transformation, during the continuous operation of the business, corresponding cost governance strategies need to be formulated based on the dynamic changes of the business. Common scenarios include the following:

  • Periodic business presentation: The business will show obvious periodic fluctuations. For example, the phenomenon of peak traffic during office hours. In this scenario, it is recommended to use the cost insight function to observe the rules and use appropriate cost optimization methods such as elastic capabilities. For more information, see Cost Analysis Function Description.

  • Frequent online-offline of new and old businesses: This is a common scenario in many emerging business areas. In the early stage of business emergence, the resource cost of the application is faced with the challenge of capacity planning. It is recommended to use the cost insight function to observe the resource cost usage of the application and recommend using the resource profiling function to plan the configuration of the application. For more information, see Cost Analysis Function Description and Resource Profiling.

Through resource status assessment, resource utilization optimization and other measures, the goal of resource utilization optimization can be achieved, and cost can be further saved.

Resource Status Assessment

Today, as cloud computing becomes increasingly infrastructure-driven, enterprises usually encounter challenges in resource management from the perspective of application architecture when using cloud services. By using tools to assist in the periodic adjustment and update of overall cloud resources to visualize resources, it can efficiently guide the optimization direction of resources, thus avoiding unnecessary waste of cloud resources due to various reasons during continuous use. For example, isolated resources, idle or low utilization resources, resources serving external services without binding public network IP and gateway, resources without mounted disks, and whether databases are deployed across multiple availability zones, etc.

Cloud users can use Alibaba Cloud’s free CADT tool for visual insight into the current state of resources. Alibaba Cloud’s Cloud Speed provides self-service cloud resource management capabilities to simplify architecture management and reduce the difficulty and time cost of architecture management. Cloud Speed provides a large number of pre-built application architecture templates, as well as resource status exploration and automatically generated visual architecture diagrams, supports self-service drag-and-drop definition of architecture, can automate configuration and management of cloud services, and can fully manage the lifecycle of cloud resources from deployment to operation to deletion.

Selecting Suitable Cloud Products and Specifications

Enterprises should select computing instances and storage types that are suitable for application and resource requirements, and use the latest generation of instances and technologies for new computing power demand scenarios.

Compute Instance Selection How to choose the appropriate instance type and size for application and resource requirements? At the micro level, for users who use Elastic Compute Service (ECS) in a self-built or open-source deployment mode, different instance types need to be selected based on different business types. For general computing and caching scenarios, large memory instances and large data tasks are recommended. In AI deep learning scenarios, heterogeneous computing instances should be selected. For different application types, Alibaba Cloud provides ECS instance selection best practices based on scenarios and business applications, and recommends different instance families for different scenarios based on more granular CPU, memory, IOPS, and other characteristics to help users use compute services with extreme cost-effectiveness.

Focus on the Latest Generation of Instances Using the latest generation of instances can achieve further cost optimization by using fewer instances or lower specifications to achieve the same performance as the current environment. Alibaba Cloud continuously upgrades the underlying infrastructure to meet the new computing power demand scenarios, and the choice of the latest generation of instance specifications needs to be paid attention to the choice of intergenerational differences. Under the same specifications, Alibaba Cloud's latest generation of Elastic Compute instances usually have higher CPU frequency, network throughput capacity, and other performance advantages, and the user experience will also be upgraded. The cost benefits of technology can be transferred to users in terms of cost, so as to provide better cost-effectiveness.

Storage Selection In order to match the storage requirements with the application and cloud service, it is necessary to answer questions about business and technical indicators in order to clearly understand the application environment supported by the storage. Block storage is a block device product provided by Alibaba Cloud for Elastic Compute Service (ECS), which has high performance and low latency and supports random read and write operations, meeting the data storage requirements of most general business scenarios. You can use block storage by formatting and establishing a file system in the same way as using a physical disk. Alibaba Cloud NAS is a distributed file system that provides shared access, elastic scaling, high reliability, and high-performance. It can support thousands of Elastic Compute Service (ECS), Container Service (ACK), and other computing nodes for shared access. It allows you to migrate business systems to the cloud without modifying the application program. Other business indicators include the scale of users, whether it is millions or tens of millions; the overall storage capacity, data compression ratio, estimated data size, and daily average increment; read and write preferences, whether the data is read or written more; whether the data has strong transactional or analytical demand; the data service engines supported by the storage, such as relational, non-relational, key-value storage, row or column storage, or graph and text storage; performance requirements, business concurrency, estimated peak and valley conditions, etc. Different indicators have certain impacts on the selection of storage types and specifications, and need to be adapted according to different scenarios.

Database Selection Different business scenarios correspond to different database products and specifications. For example, ApsaraDB for RDS is a typical online transaction processing (OLTP) business and is suitable for transaction systems. In the transaction system, the row data is updated in the database by adding, deleting, and modifying it in a disk database management system. The system has high real-time performance, strong stability, and ensures that the data is updated in real-time. Similarly, NoSQL databases solve the problem of performance and cost issues when the business data volume grows to hundreds of billions. ApsaraDB for MongoDB uses a No-Schema approach, which is very suitable for the data storage needs of start-up businesses. You can store structured data with fixed schemas in a relational database and store data with flexible schemas in MongoDB. High hot data is stored in ApsaraDB for Redis or ApsaraDB for Memcache to efficiently access business data and reduce the investment in data storage. Database selection needs to balance the three system indicators of consistency, availability, and partition tolerance. It is also necessary to pay attention to characteristics such as performance, elastic scalability, ease of maintenance, data security, disaster recovery backup, etc. in the database product dimension, and choose the database product that best matches the business characteristics and technical indicators to achieve excellent cost-effectiveness.

Design Reasonably Resource Architecture

Close Unused Resources For resources that are not needed or used for temporary testing during non-working hours, these virtual machines can be automatically shut down during non-working hours or deleted after testing tasks are completed. When judging idle resources, refer to the performance data of resources collected by Alibaba Cloud CloudMonitor in the past 30 days, taking CPU utilization rate, disk IO, and network utilization rate as reference indicators. When the CPU resource peak utilization rate is less than 1%, disk IO is less than 10, and the network utilization rate is less than 1%, the server is judged as an idle resource. Since memory is an occupying resource type, it is not considered to be included in the judgment indicators.

Optimize Snapshot Usage Costs Snapshots are frequently used in data backup and disaster recovery solution design and provide a low-cost general solution for data reliability. When using snapshots, sometimes users set a unified operations and maintenance management plan for all ECS under the account. The billing basis for snapshots is snapshot capacity. Therefore, the more snapshots there are, the larger the occupied snapshot capacity and the more snapshot costs will increase. The best practice for snapshot usage is based on actual business needs and reasonably setting snapshot strategies based on scenarios to retain a reasonable number of snapshots. For example, core applications are backed up once a day and non-core applications are backed up once a week. Long-term snapshots are regularly deleted, and system disks are not recommended for storing application data.

Optimize Storage Resource Usage When selecting storage resources, it is necessary to pay attention to the different billing methods supported by different storage products when creating cloud storage in different ways; always evaluate how your storage resources are configured reasonably according to the features of the business, develop good habits of regularly cleaning up data disks, and delete unused content. In storage scenarios with large data volumes, storage capacity packages can be used to reduce overall usage costs. At the same time, it is also necessary to optimize the utilization of storage resources, such as optimizing storage structures and optimizing storage content during the use of SLS.

1. Optimize Storage Structure Suppose you continuously collect logs from an application, and the daily data write volume is 100 GB, and you store it for 30 days and build a full-text index. The cost of Log Service will be high in this data volume scenario. If you are particularly interested in a certain type of POD logs, such as operation logs and error logs. Assuming that the proportion of these logs is 20% and you want to store them for 30 days. It is recommended to use the following processing scheme. This can save nearly 25% of the cost:

  • Build the access source Logstore to store it for 3 days without establishing an index.

  • Build the target Logstore1 for storing operation logs and error logs, store them for 30 days, and establish an index.

  • Build the target Logstore2 for storing general logs, store them for 7 days, and establish an index.

2. Optimize Storage Content Suppose you are only concerned about certain fields in the original log. You can use data processing to store the fields you care about for 30 days and establish an index, and store other redundant fields for only 3 days. This can save approximately 30% of the cost compared to before processing.

Reasonable Network Planning When using network services, from the perspective of architecture optimization, it is recommended to use the intranet as much as possible for application-to-application communication. When implementing traffic interconnectivity across accounts or VPCs, a good design for the cross-region and cross-country network products should be planned. Re-evaluate the planning and design of the network's public network egress, and recommend using services such as NAT Gateway to centrally manage network ingress and egress traffic and monitor network traffic usage in real-time to prevent a sudden increase in cost due to large-scale data transmission caused by human error or accidents.

Database Service Optimization Analysis Make good use of the tools provided by database services. Through multi-dimensional analysis of database instances, pay attention to instance utilization. Based on the actual running situation of the service, carefully evaluate the instance status, dynamically adjust the instance to achieve a reasonable usage rate, and pay attention to indicators such as CPU peak utilization, disk space, memory usage, number of connections, QPS, IOPS, etc. to optimize the system on the premise of ensuring stability. Database Autonomy Service (DAS) is a cloud service that provides self-awareness, self-repair, self-optimization, self-operation and maintenance, and self-security for databases based on machine learning and expert experience. It helps users eliminate the complexity of database management and service failures caused by manual operations, effectively ensures the stability, security, and efficiency of database services. It can also automatically diagnose and optimize SQL and create indexes. When a slow SQL problem occurs in the database instance, timely diagnosis and optimization can be performed to help the database system run in the best state.

Introduce Elastic Mechanisms for Application Load

By introducing elastic services for the application's computing resources, it can reduce resource waste during low business peaks and reduce operational costs.

ECS Elastic Mechanism Compute elastic scaling is one of the core features of the cloud, which allows automatic adjustment of elastic compute resources according to users' business needs and strategies. Users can automatically adjust their elastic compute resources according to their business needs. When the business demand increases, they can seamlessly increase elastic compute ECS instances, and when the business demand decreases, they can automatically reduce instances to save costs. Elastic scaling is a free service of Alibaba Cloud, but instances created automatically through elastic scaling need to be billed according to the ECS pay-as-you-go instance standard. The ability to use ESS can adjust compute resources without a large investment of manpower, without the need to prepare compute resources in advance, and without worrying about not being able to release redundant resources in a timely manner. Elastic scaling tasks are performed at the appropriate time to reduce resource ownership costs.

Containerization Transformation to Improve Resource Utilization Container technology isolates different processes running on hosts, achieves isolation between processes and between processes and the host operating system, and has its own set of file system resources and submodule processes. Containers do not have the additional overhead of managing programs and share the underlying operating system with the host, so they have better performance and lower system load. Under the same conditions, more application instances can run. It can make more efficient use of system resources. At the same time, containers have excellent resource isolation and restriction capabilities, and can precisely allocate CPU, memory, and other resources to applications, ensuring that applications do not affect each other. Smaller computing overhead means lower overall costs. Containers can significantly reduce the number of virtual machines you run and manage. By eliminating the need for each application to run a virtual machine, overall computing costs can be reduced. This reduction in waste and duplication of operations and resources can lead to significant cost savings. In the past few years, many Internet companies have undergone application containerization transformations. There is no doubt that the containerization transformation of enterprise applications can not only improve development and operation and maintenance efficiency, but also significantly increase the utilization of computing resources and save a lot of costs through the capabilities provided by Alibaba Cloud's Elastic Container Instance (ECI). For more container elastic strategies, please refer to the following content.

Optimize Resource Utilization

The improvement of resource utilization is essentially the maximization of meeting computing power requirements with the minimum resources, while comprehensively considering factors such as business layout, disaster recovery and stability, machine failure rate, and reserved buffer space. These factors are intertwined and jointly affect resource utilization efficiency. In summary, the following contents need to be considered: clarify the statistics criteria for resource utilization, optimize business layout and cluster architecture deployment, drive resource operations based on allocation rate and utilization rate, unify resource pools and node management, improve resource data monitoring capabilities, unified resource scheduling, and fine-grained isolation and water level control of single instance resources. At the same time, when the utilization rate of production environment resources with the highest cost proportion is significantly improved, for the overall cloud services, the benefits from comprehensive cost control can be maximized and service quality can be guaranteed.

Clarify Resource Utilization Criteria through Cloud Monitoring Use CloudMonitor to monitor the monitoring indicators of various cloud service resources, detect the availability of cloud services, and set alarms for specific monitoring indicators. CloudMonitor provides a comprehensive understanding of the usage of resources in Alibaba Cloud and the running status of businesses, timely replace faulty resources, scale up high-load resources to ensure business continuity, and scale down low-load resources to reduce resource waste.

Elastic Scaling for Cloud-Native Achieves Unified Resource Node Management Elastic scaling is one of the widely used features of cloud container services that automatically adjust the management service of elastic compute resources according to business needs and strategies. Elastic scaling has a broad application space in scenarios such as online business elasticity, large-scale computing training, deep learning GPU or shared GPU training and inference, and periodic load variations.

Elastic scaling services support two dimensions: one is elastic scaling of the scheduling layer, which is mainly responsible for modifying the scheduling capacity of the load. The horizontal scaling HPA is a typical elastic component of the scheduling layer, through which the number of application replicas can be adjusted, and the adjusted replicas will change the current load capacity of the application, thereby achieving the elasticity of the scheduling layer. The second is elastic scaling of resources, mainly the capacity planning of the cluster cannot meet the scheduling capacity of the cluster. It supplements the scheduling capacity through resource deployment.

These two elastic components and capabilities can be used separately or in combination, and the two are decoupled based on the capacity status of the scheduling layer.

Cloud-Native Resource Scheduling Optimizes Resources Based on Application Loads To achieve precise and real-time instance scaling and placement, application load characteristics must be used as the basis for resource scheduling. Use elastic scheduling strategies to manage elastic compute resources required by applications, and use flexible scheduling strategies to adjust with the increase of application workloads and ensure stable application performance. When the workload decreases, recover resources to promote the circulation of resources between tenants and improve resource utilization. ACK fine-grained scheduling provides a more real-time, active, and intelligent way to achieve a good user experience. By elastically scheduling computing resources, the metric collection, online decision-making, offline analysis, and decision optimization are completed.

Storage Lifecycle Management

As applications and business systems run for a long time, enterprises accumulate a large amount of data. At the same time, with the increase in the business volume undertaken by the business team, the types of data sources required will become more and more diverse.

Under normal circumstances, the access frequency of data that was recently written will be much higher than that written a long time ago. At this time, we can consider these data as "hot". As time goes by, the initial "hot" data gradually decreases in access frequency. When it is accessed only a few times a week or several times a month, it may be classified as "warm" data. In the following 3 to 6 months, when the data has not been accessed at all or the access frequency drops to several times a month or a few months, it may be defined as "cold" data. Finally, when the data is seldom used in a year, with an access frequency of only one or two times, it may be considered as "frozen" data in its lifecycle. The increasing cold and frozen data has put increasing pressure on the storage space and costs of existing clusters. Cold data storage cost control should not be neglected in terms of uncontrollable cost effects and the design of storage lifecycle management also needs to leave performance optimization space for hot data access frequencies.

By grouping different combinations, the long-term storage cost of cold data needs to be lower than the storage cost of hot data, and cold data needs to be easily readable and analyzed in a distributed file resource creation time distribution to achieve end-to-end lifecycle management. Based on the different cost situations of different storage media, data is saved to achieve the purpose of reducing data storage costs. In the cloud, different storage types can be distinguished based on the access frequency of data, which comprehensively covers various data storage scenarios from hot to cold, so as to achieve the optimal storage cost for data within its lifecycle and meet daily business needs.

Alibaba Cloud provides various cloud-native storages for cold and hot storage scenarios:

Object Storage Service(OSS) provides multiple storage types including Standard, Infrequent Access (IA), Archive, or Deep Archive, which fully cover various data storage scenarios from hot to cold. PolarDB-X provides cold data archival technology. If some tables in the cluster have almost no update, insert, and modification operations, and the read access frequency is very low. If you have a cost reduction requirement, you can use the cold data archival function provided by PolarDB MySQL to transfer these data to low-cost OSS storage to reduce data storage costs. AnalyticDB MySQL provides three types of optimization recommendations: cold and hot data optimization, index optimization, and distribution key optimization. It helps users reduce cluster usage costs and improve cluster utilization efficiency through intelligent analysis of statistical information.

Resource Governance in Cloud-Native

Cloud-native technologies provide shared, isolated, and elastic resource management capabilities, which can significantly improve resource utilization efficiency and reduce enterprise resource usage costs in a simple and efficient manner. However, the reality is that for most enterprises, the cost of using cloud-native containerized elastic compute resources has increased to a certain extent.

There may be two reasons for this phenomenon. First, the cost increase is caused by improper use of some cloud-native technologies, which leads to technical debt. Second, traditional enterprises are still mainly engaged in offline business, with low levels of digitalization and lack of experience. When faced with the introduction of cloud-native architectures, they often lack effective cost insights and cost control methods, making it difficult to analyze the reasons for cost increase. When introducing advanced technologies, the complexity of cost management is also introduced. How to manage resources well at the container layer is a practical problem that every enterprise faces.

Through the cost governance practices in cloud-native scenarios, cost optimization capabilities can be integrated into the container management platform, and aggregation analysis can be performed from both the physical and logical dimensions. The physical dimension includes clusters, nodes, and resource groups, and the logical dimension includes Pods, application loads, and namespaces. The costs of the physical and logical dimensions are connected, and a complete resource cost profile is established to achieve accurate and reasonable governance work.

Automation

Online resources are managed through automatic elastic and automated tools, achieving rational resource utilization, reducing manpower and operational costs, and avoiding operational errors.

  • Auto Scaling provides continuous maintenance of instance clusters across payment methods, availability zones, and instance specifications. It is suitable for scenarios where business loads experience peak and valley fluctuations.

  • Elastic Supply provides one-click deployment of instance clusters across payment methods, availability zones, and instance specifications. It is suitable for scenarios where stable computing power needs to be delivered quickly and cost reduction requirements are met through preemptible instances.

  • OOS defines a set of operations through a template, and efficiently executes O&M tasks. It is suitable for event-driven O&M, scheduled O&M, batch O&M, and cross-region O&M.

  • ROS deploys and maintains resource stacks that include multiple cloud resources and dependencies in one click. It is suitable for delivering overall systems, cloning environments, etc.