Best practices for cost optimization - Elastic Compute Service

This topic describes the cost components and benefits of Elastic Compute Service (ECS) and provides cost management solutions that maximize cost-effectiveness and accelerate business development.

Cost components

The total cost of traditional enterprise IT infrastructure, also known as total cost of ownership (TCO), includes the procurement price and the expenses related to deployment, operation, and maintenance. When you evaluate IT infrastructure, the actual metric that you evaluate is the TCO per unit of IT infrastructure. To calculate TCO, you must consider the actual business deployment environment variables, such as rack rental fees, electricity costs for racks, and server brands and prices, and whether measures such as dual devices and dual uplinks can be used to prevent single points of failure (SPOFs). You can calculate TCO by using the following formula: TCO = Server expenditure + Network expenditure + Data center expenditure + Other expenses (including the labor cost, public network cost, and additional taxes).

Among the four components of data center costs, server procurement and network construction costs are categorized as capital expenditure (CAPEX), which must be deprecated over a specific period of time after purchase. The remaining data center expenditures, such as rent and electricity costs, and other expenses, are categorized as operational expenditure (OPEX), which is continuously incurred based on the resource usage duration. From a business operation perspective, CAPEX involves a significant one-time investment with a high degree of business uncertainty and leaves little room for adjustment. Demand changes may result in unnecessary costs. In contrast, OPEX is more adaptable to business variations. Converting all CAPEX into OPEX is a better solution for dealing with demand uncertainty.

Alibaba Cloud ECS provides cloud computing resources. You can replace the traditional IT infrastructure of your enterprise with ECS to reduce CAPEX and increase the proportion of OPEX. This improves the cash flow and risk resistance capabilities of your enterprise. ECS costs consist of the following components:

Ownership cost: involves the costs of resources and resource plans, including:
- Instance type fees
- Disk capacity fees
- Image fees
- Public bandwidth fees
- Snapshot fees
O&M cost: involves the labor costs generated when you use ECS, which may include:
- Labor costs for system management and maintenance
- Labor costs for security monitoring and protection
- Labor costs for troubleshooting and repair
- Labor costs for software update and configuration

Cost benefits of cloud migration

To build a data center, consider the direct costs of hardware, networking, electricity, machine rooms, and O&M. You must also consider the scale costs from upgrades and capacity expansions and the risk costs associated with data backup and high-availability implementations. When you scale up your data center to meet growing business requirements, the cost per unit of resources and complexity of the data center increase and the fault tolerance decreases. If you select business models that do not meet your business requirements, additional costs are generated.

Compared with self-managed data centers, cloud resources eliminate the need to invest upfront in hardware, physical environments, and labor. The unit cost of cloud resources is relatively linear. You can create or release cloud resources based on your business requirements. Cloud resources support multiple billing methods to allow cost optimizations.

Cost optimization suggestions

Optimize resources

If you find high-cost resources, you can monitor the resources across different aspects to determine the reasons for the high costs and take targeted optimization measures.

Monitor resource usage.
1. Monitor the usage of resources, such as CPU, memory, disks, and bandwidth. Assess whether the current configuration is higher than the required configuration.
2. Monitor idle resources to prevent waste. Idle resources include instances that are upgraded but not restarted, reserved instances that are not matched to pay-as-you-go instances, disks that are not attached to instances, and elastic IP addresses (EIPs) that are not associated with instances.
3. Monitor resource usage cycles. If you require resources such as instances and disks for long-term use, we recommend that you purchase subscription resources or purchase resource plans to reduce costs.
4. Monitor the lifecycle of resources. Take note of the expiration dates of subscription resources, such as subscription instances, reserved instances, and storage capacity units. Renew resources at the earliest opportunity.
Select instance types based on your business scenarios.
Instance types have significant impacts on ECS costs. Select the most cost-effective instance type and adjust the number of instances based on your business scenarios. This way, you can maximize resource utilization and minimize costs while meeting your business requirements.
For example, you use 10 d1ne.14xlarge instances for short-form videos. The monitoring results indicate a proper memory usage but a low CPU utilization of the instances. To resolve the issue, perform the following operations:
Reduce the CPU-to-memory ratio of the instances to increase CPU utilization without affecting your business. The CPU-to-memory ratio of d1ne.14xlarge instances is 1:4. The CPU-to-memory ratio of d2s instances is 1:4.4. Replace the 10 d1ne.14xlarge instances with 13 d2s.10xlarge instances to reduce costs by approximately 18%.
For information about how to select instance types, see Instance type selection.
Combine multiple billing methods.
Different types of business have different requirements for resource usage cycles. Select a billing method for each type of business and combine billing methods to optimize costs.
- Use subscription instances and reserved instances for stable business workloads.
- Use pay-as-you-go instances for stateful and dynamic business workloads.
- Use spot instances for stateless and fault-tolerant business workloads.
Use dedicated hosts to allow the reuse of ECS instance resources.
In scenarios in which the absolute stability of CPUs is not a strict requirement, such as development and test environments, you can use CPU-overprovisioned dedicated hosts to deploy additional similar-sized ECS instances to reduce the cost per unit of deployments.
Stopped ECS instances that are deployed on dedicated hosts do not consume resources. During off-peak hours, you can stop specific ECS instances in the production environment and use idle resources to run test tasks that have predictable cycles, such as offline computing and automated tests.

Upgrade instance types

ECS and hardware such as processors are continuously upgraded to improve performance and reduce costs. In most cases, later instance types are more cost-effective than earlier instance types.

The following table describes the differences between the g5.2xlarge and g6.2xlarge instance types in terms of performance and price.

Performance	Price
The integer computation performance is improved by 40%. The floating-point computation performance is improved by 30%. The memory bandwidth is increased by 15%. The memory idle latency is decreased by 40%. The internal bandwidth is increased by 220%.	The annual subscription price is reduced by 6%. The pay-as-you-go price is reduced by 43%.

To ensure that you have access to the next-generation instance types at the earliest opportunity, we recommend that you perform the following operations:

Design robust applications that can run on different instance types.
Stay updated on the new instance types that are released on the official Alibaba Cloud website and determine whether to upgrade instance types.

Examples of instance type upgrade

You can use one of the following upgrade schemes to improve business performance without the need to change CPU and memory specifications and reduce costs by at least 15%.

Current instance family	Recommended compatible instance family	Recommended alternative instance family
sn1 and sn2	c6 g6 r6	c5 and sn1ne g5 and sn2ne r5 and se1ne
c4	hfc6 and c6	hfc5 and c5
ce4	r6	r5 and se1ne
cm4	hfc6	hfc5 and g5
n1, n2, and e3	c6 g6 r6	c5 and sn1ne g5 and sn2ne r5 and se1ne
t1 s1, s2, and s3 m1 and m2 c1 and c2	c6 g6 r6	c5 and sn1ne g5 and sn2ne r5 and se1ne

Regular cost saving measures

You can use cloud resources based on your business requirements and save on the investment and cost of setting up and operating self-managed data centers. However, you must constantly optimize costs in your daily work to improve cost performance. You can refine the following common operations to create a practical scheme:

Hold regular cost meetings. Review budget implementation with cost-related parties, such as finance and R&D teams, evaluate optimization results, and improve optimization strategies on a regular basis.
Enforce the use of tags. Tag resources by business, environment, and owner to track daily costs.
Classify resources and select appropriate usage methods. For example, pay-as-you-go instances are recommended for deploying development and testing environments for short-term projects and can be released immediately after the projects are complete.
Avoid idle resources. Check resource usage on a regular basis and determine the notification and disposal workflows of idle resources.
Renew resources at the earliest opportunity. Apply for a budget for subscription resources in advance to avoid the additional cost of purchasing and deploying new resources after existing resources are released upon expiration.

Automate O&M

Alibaba Cloud provides various O&M services to help you improve O&M efficiency and reduce O&M labor costs. Examples:

Auto Scaling: allows you to maintain instance clusters across different billing methods, instance types, and zones. This service is suitable for scenarios in which business workloads fluctuate.
Auto Provisioning: allows you to deploy instance clusters across different billing methods, instance types, and zones. This service is suitable for scenarios in which consistent compute capacity must be promptly provisioned and spot instances are used to reduce costs.
CloudOps Orchestration Service: allows you to define a series of O&M operations in a template to perform O&M tasks in an efficient manner. This service is suitable for scenarios in which event-driven, scheduled, batch, or cross-region O&M is required.
Resource Orchestration Service: allows you to deploy and maintain stacks that contain multiple cloud resources and dependencies among the resources. This service is suitable for scenarios in which delivery of an integrated system or environment clone is required.