Best practices for ACK cluster cost management - Container Service for Kubernetes

Cluster cost optimization aims to use cluster resources economically and efficiently, reducing unnecessary expenses. This topic provides best practices to help you optimize cluster costs and balance the expenses on ensuring workload stability and reliability, ensuring cluster stability and reliability, and maintaining clusters. After reading these best practices, you will learn how to configure clusters with low costs, how to use the scaling capabilities of workloads and nodes, and how to monitor cluster costs in real time.

About this topic

This topic is intended for administrators of Container Service for Kubernetes (ACK) clusters. This topic provides general suggestions on cluster cost reduction. The suggestions in this topic are not sorted in a particular order. You can choose and apply the suggestions based on your business requirements. We recommend that you learn about the following Kubernetes terms before you read this topic: pods and container resources, namespaces, auto scaling (workload scaling and node scaling), and scheduling.
To ensure the stability of applications deployed in an ACK cluster, we recommend that you combine the suggested configurations in the Suggested configurations for creating HA clusters and Recommended workload configurations topics.
If your cluster is an ACK Pro cluster and contains more than 500 nodes or 10,000 pods, we recommend that follow the guides in the Suggestions on using large-scale clusters topic.
This topic aims to help you build a FinOps system and team and set up FinOps strategic objectives. FinOps is short for "Finance + DevOps". It is a combination of culture and practices for enterprise cloud financial management. FinOps can help enterprises estimate the costs of cloud resources and make the use of cloud resources transparent. FinOps can efficiently control and optimize costs to facilitate business development and innovation. For more information, see Cost Suite.

Configure clusters with low costs

Before you deploy a cluster, assess the resource demand of the applications to be deployed in the cluster, and choose proper instance types and cluster billing methods to reduce the overall cost.

Choose proper ECS instance types

When you choose instance types for a node pool, you need to consider the performance, price, and workload factors to ensure stability and cost-effectiveness. In most cases, Elastic Compute Service (ECS) instance types with high specifications (CPU and memory specifications) or ECS instance types specialized in certain sectors (such as GPU-accelerated instances or heterogeneous instances) are offered at high prices.

Choose instance types based on business scenarios

Choose the most cost-effective instance types based on your business scenarios. For example, in distributed caching scenarios, memory-heavy applications are usually used. Compare with other instance types, memory-optimized instance types with a vCPU-to-memory ratio of 1:8 can efficiently improve CPU utilization and reduce costs. In deep learning scenarios, deep learning tasks usually require large amounts of GPU resources and also occupy CPU resources to perform data preprocessing and I/O operations. Therefore, the suggested GPU-to-vCPU ratio is from 1:8 to 1:12. For more information about suggestions on choosing ECS instance types, see Suggestions on choosing ECS specifications for ACK clusters.

Note

To avoid resource bottlenecks and resource fragments, we recommend that you do not use instance types with low specifications in production environments, such as instance types with 2 vCPUs and 4 GB of memory or less. For more information, see ECS specification recommendations for ACK clusters.

Use shared instance families

We recommend that you use shared instance families if you are an individual developer or you want to build small or medium-sized website applications. Compared with enterprise-level instance families, shared instance families may cause performance fluctuations. Therefore, the prices of shared instance families are lower. Shared instance families are suitable for small and medium-sized websites, web applications, development environments, lightweight databases, and lightweight enterprise-class applications.

For more information about the introduction to shared instance families and how to choose instance types, see Shared instance families.

Choose a proper billing method

Different business types have different requirements on the lifecycle of resources. You need to select a proper billing method to optimize costs.

Subscription
The subscription billing method allows you to continuously use resources at a discounted price. If your business has the following characteristics, we recommend that you use the subscription billing method:
- Predictable resource lifecycle.
- Stable workloads without fluctuations.
- Long-term resource demand.
For example, to continuously provide web or database services, we recommend that you use the subscription billing method.
Pay-as-you-go
Pay-as-you-go is a flexible billing method, which allows you to pay for the actual resources that you use. You do not need to purchase large amounts of resources in advance. Pay-as-you-go resources are more cost-effective than self-managed data centers. If your business has the following characteristics, we recommend that you use the pay-as-you-go billing method:
- Your workloads periodically fluctuate or traffic occasionally spikes. The resource demand is hard to predict.
- The resource demand fluctuates. You need to frequently deploy and release resources.
For example, to temporarily expand workloads for tests, business development, or e-commerce promotional events, we recommend that you use the pay-as-you-go billing method.
Preemptible instances
Preemptible instances are pay-as-you-go instances whose price changes based on the inventory in real time. Compared with regular pay-as-you-go instances, preemptible instances can help you reduce the total cost by 90%. However, preemptible instances may be reclaimed anytime after they are expired. Therefore, preemptible instances are suitable only for stateless workloads with strong fault tolerance capabilities, such as batch processing and machine learning workloads, big data extract, transform, and load (ETL) jobs (such as Apache Spark), queued transaction processing applications, and applications that use REST APIs. Preemptible instances are not suitable for long-term businesses or applications that require high stability. For more information, see Best practices for preemptible instance-based node pools.

Use savings plans

If you need to use ECS instances or elastic container instances for long-term businesses, you can purchase savings plans to enjoy large discounts. Savings plans allow you to enjoy discounted pay-as-you-go prices by committing to a specific amount of payment within one, three, or five years. For more information, see Overview of savings plans and Purchase and apply savings plans.

Choose a proper region for your cluster

The price of ECS instances may vary based on the region. In most cases, users can enjoy a lower network latency and higher transmission speed if your cluster is deployed in a region close to the users. If your businesses can tolerate a high network latency, you can deploy your cluster in a region that offers a low instance price. For more information about the price of ECS instances in different regions, see Elastic Compute Service.

Use ACK managed clusters

The control planes of an ACK managed cluster are created and hosted by ACK. You need only to create worker nodes. You do not need to manage control planes (master nodes) or pay resource fees for them. Therefore, ACK managed clusters are more cost-effective compared with ACK dedicated clusters.

If you want to run large-scale businesses and your businesses require high stability or security, we recommend that you use ACK Pro clusters. ACK Pro clusters are covered by SLAs that that include compensation clauses, and provide enhanced reliability, security, and schedulability. For more information, see Overview of ACK Pro clusters.

Optimize resource allocation for workloads

Configure appropriate resource requests and limits

You must configure resource requests and limits properly. Excessively high resource requests and limits may result in resource waste and excessively low resource requests and limits may compromise the stability of the cluster during peak hours. In most cases, you can reference statistics such as the historical resource utilization of containers and the stress test results of applications to assess the application stability level and resource utilization level. This helps you adjust resource allocation based on the status of containers.

We recommend that you use resource configurations suggested by resource profiling. Resource profiling analyzes the historical resource usage data in a cluster to generate suggested container resource specifications. Resource profiling not only reduces the complexity of resource request and limit settings, but also allows you to set resource specifications for each container to improve resource utilization. For more information, see Resource profiling.

Manage namespaces and resource quotas

In multi-tenant scenarios, you may need to deploy applications in different namespaces for different businesses or teams. You can create namespaces to isolate resources and set resource quotas for different namespaces to limit the amount of resources that can be used by each application or team. You can configure CPU, memory, storage, and pod quotas for namespaces. For more information, see Manage namespaces and resource quotas.

Use proper scaling capabilities

If your workloads fluctuate greatly, we recommend that you configure auto scaling based your business requirements. Auto scaling can quickly scale out pods during peak hours to handle traffic spikes, and scale in pods during off-peak hours to reduce resource costs. You only need to pay for the resources you actually use, without having to configure and pay for resources based on peak demand, which greatly reduces the cluster cost.

Workload scaling: This scheduling layer solution operates at the pod level by dynamically adjusting the number of pods or the amount of resources allocated to pods based on workload changes. For example, HPA can automatically adjust the number of application pods based on traffic changes to further adjust the amount of resources occupied by the current workload.
Compute resource scaling: This resource layer solution consists of node scaling and virtual node scaling. You can use this solution to increase or decrease the amount of resources allocated to your applications based on pod scheduling results and resource usage.

Workload scaling

Solution	Description
Horizontal Pod Autoscaling (HPA)	HPA automatically scales pods based on CPU usage, memory usage, or custom metrics. HPA can scale out pods during peak hours to handle traffic spikes and scale in pods during off-peak hours to reduce resource costs. HPA is suitable for scenarios where you need to deploy large numbers of services and frequently scale resources for workloads that fluctuate greatly. We recommend that take note of the following items when configuring HPA in order to ensure the stability of applications: Configure proper resource requests and limits. Configure pod health check and pod auto recovery to ensure that pods start to receive traffic only after they enter the Ready state. Make sure that pods can be quickly started and stopped. Make sure that Metrics Server runs as normal.
Cron Horizontal Pod Autoscaling (CronHPA)	CronHPA periodically scales resources based on crontab-like policies. CronHPA is suitable for scenarios where you need to run tasks or handle traffic spikes at a scheduled time. When you configure CronHPA, we recommend that you take note of the following items to ensure the stability of applications: Configure proper resource requests and limits. Configure pod health check and pod auto recovery to ensure that pods start to receive traffic only after they enter the Ready state. Make sure that pods can be quickly started and stopped. If you have configured both HPA and CronHPA, make sure that they do not conflict during resource scaling. For more information, see Make CronHPA compatible with HPA.
Vertical Pod Autoscaling (VPA)	VPA generates suggested CPU and memory specifications based on the historical resource usage data of pods, and adjusts the resource configuration to meet resource demand. VPA is suitable for stateful applications that require stable resource supply. When you configure VPA, we recommend that you take note of the following items to ensure the stability of applications: Configure pod health check and pod auto recovery to ensure that pods start to receive traffic only after they enter the Ready state. Make sure that pods can be quickly started and stopped. Configure pod disruption budgets to ensure that at least the specified number of pods are running. Make sure that Metrics Server runs as normal. Follow the VPA precautions.
Adaptive Horizontal Pod Autoscaling (AHPA)	AHPA automatically identifies the scaling cycle and estimates the required resource capacity by analyzing the historical resource statistics of applications, and dynamically scales out pods to ensure that resources are provisioned before traffic spikes occur. In addition, AHPA can scale in pods promptly during off-peak hours. When you configure AHPA, we recommend that you take note of the following items to ensure the stability of applications: Configure proper resource requests and limits. Configure pod health check and pod auto recovery to ensure that pods start to receive traffic only after they enter the Ready state. Make sure that pods can be quickly started and stopped.
Kubernetes Event-driven Autoscaling (KEDA)	KEDA periodically consumes events from event sources such as Kafka, MySQL, PostgreSQL, RabbitMQ, and MongoDB and scales resources accordingly. KEDA is suitable for video/audio offline transcoding, event-driven jobs, and data streaming. When you configure KEDA, we recommend that you take note of the following items to ensure the stability of applications: Configure proper resource requests and limits. Configure pod health check and pod auto recovery to ensure that pods start to receive traffic only after they enter the Ready state. Make sure that pods can be quickly started and stopped.

Node scaling

When you use workload scaling capabilities, we recommend that you enable node scaling in case pod scheduling failures occur due to insufficient node resources. For more information about how to choose between node auto scaling and node instant scaling, see Scaling solutions: node auto scaling and node instant scaling.

Solution	Description
Node auto scaling	You can use node auto scaling to automatically scale nodes when resources in the current Container Service for Kubernetes (ACK) cluster cannot fulfil pod scheduling. The node auto scaling feature applies to scenarios with limited scaling requirements. This includes clusters that have less than 20 node pools with auto scaling enabled, or where nodes per node pool remain below 100. Node auto scaling is optimal for workloads with stable traffic patterns, periodic or predictable resource demands, and operations where single-batch scaling meets business requirements. When you configure node auto scaling, we recommend that you take note of the following items to ensure the stability of applications: Configure proper resource requests and limits. Configure pod disruption budgets to ensure that at least the specified number of pods are running.
Node instant scaling	Compared with node auto scaling, node instant scaling enables scaling clusters on a large scale or at a faster pace. This feature lowers the technical skill requirements for developers, improves resource scaling efficiency, and reduces the need for manual O&M efforts. Node instant scaling has certain limits. For more information, see Limits of node instant scaling.
Virtual nodes	When you use an ACK cluster, you may need to launch a large number of pods within a short period of time. If you choose to create ECS instances for the pods, the creation process can be time-consuming. If you choose to reserve ECS instances, the instances are idle before pod creation and after pod termination, resulting in resource waste. You can create ACK virtual nodes to quickly schedule pods to elastic container instances. This way, you do not need to purchase or manage ECS instances. For more information, see Introduction to virtual node scheduling and solution comparison and Schedule pods to elastic container instances.

Optimize pod scheduling

Dynamic resource overcommitment

If pods whose QoS classes are Guaranteed and Burstable are colocated in a cluster, you can configure dynamic resource overcommitment to use resources that are allocated but not in use.

To handle workload fluctuations in upstream and downstream services in an ACK clusters, the application administrator usually needs to configure a resource buffer for each Guaranteed or Burstable pod. Consequently, the amount of resources used is much lower than the requested resource amount. To use resources that are allocated but not in use, you can use configure dynamic resource overcommitment to improve resource utilization.

Dynamic resource overcommitment allows you to define a resource redundancy rate to dynamically overcommit resources that exceed the redundancy rate. The amount of resources that can be dynamically overcommitted on a node changes based on the actual resource usage. You can prioritize Best Effort (BE) pods when scheduling pods to the node. For more information, see Enable dynamic resource overcommitment.

For example, in colocation scenarios, you need to deploy Latency Sensitive (LS) pods and resource-heavy BE pods in a cluster or on a node. To meet the QoS requirements of LS pods, we recommend that you configure resource overcommitment for BE pods to conduct fine-grained CPU and memory management. This helps improve the overall resource utilization. For more information, see Getting started.

GPU sharing

To run multiple containers on one GPU for GPU cost optimization, you can use GPU sharing.

Single GPU sharing and multiple GPU sharing are available. In single GPU sharing mode, each pod requests one GPU and occupies partial GPU resources. This mode is suitable for model inference scenarios. In multiple GPU sharing mode, each pod requests multiple GPUs. The same amount of resources is allocated from each GPU. This mode is suitable for distributed model development and training. You can also configure GPU sharing and isolation policies. For example, you can configure multiple pods to preferably use one GPU or spread the pods to multiple GPUs. For more information, see GPU sharing.

Set up resource and cost monitoring

Use the cost insights feature to monitor the costs of departments or applications

IT spending administrators usually need to learn the usage of cluster resources and the trend of changes in resource costs from different dimensions. This helps them reduce resource costs and improve resource utilization. You can use the cost insights feature provided by ACK clusters to view the costs and resource usage of clusters, departments, or applications within the specified cost governance cycle. For more information, see Cost Insights.

Pods are the smallest deployable units in ACK clusters. Pod costs are important factors in the calculation of the cluster cost. Different pods may have different resource specifications, scheduling policies, and lifecycles. Therefore, estimating the cost of a pod is complex. The cost insights feature uses the cost data model to estimate the cost and cost ratio of each pod and then allocate the total cost to different business units. You can also use multi-dimensional cost dashboards to analyze the trend of historical resource usage and cost details to locate the cause of unexpected costs.

Periodically scan for idle resources

You can periodically scan for and release idle resources in ACK clusters, such as CPU, memory, storage, and network resources, to reduce resource costs.

You can also enable the cost insights feature to view idle resources in pods, locate the cause, and then optimize the resource allocation policy. For more information, see Billing methods and pod usage.

ACK clusters also provide features to help you identify cluster-related idle resources, such as ECS instances, Elastic Block Storage (EBS) resources, Classic Load Balancer (CLB) instance, and elastic IP addresses (EIPs). For more information, see Idle resource optimization.