Cluster cost optimization means using cluster resources economically without compromising workload stability. This guide covers instance selection, billing methods, auto scaling, pod scheduling, and cost monitoring — giving you a practical framework for building a FinOps practice on Alibaba Cloud.
This guide is intended for administrators of Container Service for Kubernetes (ACK) clusters. The recommendations are not ordered by priority — apply them based on your business requirements. Before you begin, familiarize yourself with the following Kubernetes concepts: pods and container resources, namespaces, auto scaling (workload scaling and node scaling), and scheduling.
Related topics:
To ensure application stability, combine these recommendations with Suggested configurations for creating HA clusters and Recommended workload configurations.
If your ACK Pro cluster has more than 500 nodes or 10,000 pods, follow Suggestions on using large-scale clusters.
To build a FinOps system and set strategic cost objectives, see Cost Suite. FinOps (Finance + DevOps) is a set of practices for cloud financial management that helps teams estimate, track, and optimize cloud resource costs.
Choose instance types and billing methods
Before creating a cluster, assess the resource requirements of your workloads and select instance types and billing methods that balance performance and cost.
Match instance types to workload profiles
The following table summarizes common workload profiles and the recommended instance types and billing methods for each.
| Workload profile | Recommended instance type | Recommended billing method | Notes |
|---|---|---|---|
| Web services and databases (stable, long-running) | General-purpose (e.g., ecs.g series) | Subscription or savings plan | Predictable lifecycle; subscription and savings plans reduce unit cost |
| Distributed caching (memory-intensive) | Memory-optimized (1:8 vCPU-to-memory ratio) | Subscription or pay-as-you-go | Memory-optimized instances improve CPU utilization at lower cost for memory-heavy apps |
| Deep learning and model training | GPU-accelerated | Pay-as-you-go or preemptible | GPU-to-vCPU ratio: 1:8 to 1:12; use preemptible instances for fault-tolerant training jobs |
| Batch processing, ETL, event-driven jobs | Any | Preemptible | Up to 90% cheaper than pay-as-you-go; suitable for stateless, fault-tolerant workloads |
| Dev/test and small websites | Shared instance families | Pay-as-you-go | Lower price; may have performance fluctuations; not suitable for production |
| Traffic spikes and e-commerce promotions | General-purpose | Pay-as-you-go | Flexible; no advance commitment required |
Avoid instance types with 2 vCPUs and 4 GB of memory or less in production environments to prevent resource bottlenecks and fragmentation. See ECS specification recommendations for ACK clusters for details.
For general guidance on instance selection, see Suggestions on choosing ECS specifications for ACK clusters.
Use preemptible instances
Preemptible instances are pay-as-you-go instances priced based on real-time inventory. They can reduce total costs by up to 90% compared to regular pay-as-you-go instances.
Preemptible instances may be reclaimed after they expire. Use them only for stateless, fault-tolerant workloads such as:
Batch processing and machine learning
Big data ETL jobs (for example, Apache Spark)
Queued transaction processing
Applications that use REST APIs
For best practices, see Best practices for preemptible instance-based node pools.
Use shared instance families
Shared instance families are suitable for individual developers and small-to-medium websites, web applications, development environments, lightweight databases, and lightweight enterprise-class applications. Because CPU resources are shared, performance may fluctuate under heavy load. For details, see Shared instance families.
Use savings plans
For long-running ECS instances or elastic container instances, purchase savings plans to get discounted pay-as-you-go pricing. Savings plans require a commitment of 1, 3, or 5 years. See Overview of savings plans and Purchase and apply savings plans.
Choose a region
ECS instance prices vary by region. If your workloads can tolerate higher network latency, deploying in a lower-cost region can reduce expenses. For pricing by region, see Elastic Compute Service.
Use ACK managed clusters
ACK managed clusters host the control plane (master nodes) on Alibaba Cloud — you only provision worker nodes and pay no resource fees for the control plane. This makes managed clusters more cost-effective than dedicated clusters.
For large-scale workloads requiring high stability and security, use ACK Pro clusters. ACK Pro clusters are covered by SLAs with compensation clauses and offer enhanced reliability, security, and schedulability. See Overview of ACK Pro clusters.
Right-size workload resource allocation
Configure resource requests and limits
Set resource requests and limits accurately. Requests that are too high waste capacity; requests that are too low risk instability during peak load. Use historical container utilization data and stress test results to calibrate these values.
Use resource profiling to generate suggested container resource specifications from historical usage data. Resource profiling reduces the complexity of tuning resource settings and improves overall utilization. See Resource profiling.
Continually revisit resource requests and limits as your application evolves — configurations that were accurate at deployment may become outdated over time.
Manage namespaces and resource quotas
In multi-tenant clusters, use namespaces to isolate resources for different teams or workloads and set resource quotas to cap consumption per namespace. Configurable quota types include CPU, memory, storage, and pod count. See Manage namespaces and resource quotas.
Use auto scaling to reduce idle capacity
Auto scaling is one of the most effective ways to cut cluster costs. Scale out pods during peak hours to handle traffic, and scale in during off-peak hours so you only pay for what you use — without over-provisioning for peak demand.
ACK supports two layers of auto scaling:
Workload scaling: adjusts pod count or pod resource allocation at the workload level
Compute resource scaling: adjusts node count or provisions virtual nodes based on pod scheduling demand
Workload scaling
| Solution | When to use | Key considerations |
|---|---|---|
| Horizontal Pod Autoscaling (HPA) | Services with fluctuating traffic; scale based on CPU usage, memory usage, or custom metrics | Configure resource requests and limits; configure pod health checks and auto recovery; make sure Metrics Server is running |
| Cron Horizontal Pod Autoscaling (CronHPA) | Predictable traffic patterns; scheduled scale-out at fixed times | Configure resource requests and limits; configure pod health checks and auto recovery; if using HPA and CronHPA together, prevent conflicts — see Make CronHPA compatible with HPA |
| Vertical Pod Autoscaling (VPA) | Stateful applications with stable resource demand; right-size pod CPU and memory based on historical usage | Configure pod disruption budgets (PDBs); make sure Metrics Server is running; review VPA precautions |
| Adaptive Horizontal Pod Autoscaling (AHPA) | Workloads with recurring traffic cycles; scale out proactively before spikes based on historical patterns | Configure resource requests and limits; configure pod health checks and auto recovery |
| Kubernetes Event-driven Autoscaling (KEDA) | Event-driven workloads consuming from Kafka, MySQL, PostgreSQL, RabbitMQ, or MongoDB; video/audio transcoding, data streaming | Configure resource requests and limits; configure pod health checks and auto recovery |
Node scaling
Enable node scaling alongside workload scaling to prevent pod scheduling failures when cluster node resources are insufficient. To choose between node auto scaling and node instant scaling, see Scaling solutions: node auto scaling and node instant scaling.
| Solution | When to use | Key considerations |
|---|---|---|
| Node auto scaling | Clusters with fewer than 20 auto scaling-enabled node pools, or fewer than 100 nodes per node pool; stable or predictable traffic patterns | Configure resource requests and limits; configure pod disruption budgets (PDBs) |
| Node instant scaling | Large-scale or rapid scaling needs that exceed node auto scaling limits | Review limits of node instant scaling before enabling |
| Virtual nodes | Burst workloads requiring many pods in a short window without provisioning ECS instances | See Introduction to virtual node scheduling and solution comparison and Schedule pods to elastic container instances |
Optimize pod scheduling
Dynamic resource overcommitment
When Guaranteed and Burstable pods share a cluster, application administrators typically configure a resource buffer for each pod to absorb workload fluctuations — leaving a gap between requested and actual resource usage. Dynamic resource overcommitment lets you reclaim that unused capacity for Best Effort (BE) pods.
Configure a resource redundancy rate to define how much buffer to preserve. Resources beyond the redundancy rate are dynamically made available to BE pods. The overcommittable amount on each node adjusts in real time based on actual usage. You can prioritize Best Effort (BE) pods when scheduling pods to the node.
For colocation scenarios where Latency Sensitive (LS) pods and resource-heavy BE pods share a node, configure resource overcommitment for BE pods to enable fine-grained CPU and memory management. See Enable dynamic resource overcommitment and Getting started.
GPU sharing
Run multiple containers on a single GPU to reduce GPU costs.
Two modes are available:
Single GPU sharing: each pod requests one GPU and occupies a portion of its resources. Suitable for model inference scenarios.
Multiple GPU sharing: each pod requests multiple GPUs, with the same resource amount allocated from each. Suitable for distributed model development and training.
Configure GPU sharing and isolation policies — for example, pack multiple pods onto one GPU or spread them across GPUs. See GPU sharing.
Monitor costs and identify waste
Use Cost Insights
Cost Insights lets you view resource usage and costs for clusters, departments, or applications within a specified cost governance cycle. The feature uses a cost data model to estimate the cost of each pod — the smallest deployable unit in ACK — and allocates the total cost to business units. Multi-dimensional dashboards let you analyze historical usage trends and pinpoint the source of unexpected charges. See Cost Insights.
Scan for idle resources
Periodically scan for and release idle resources — including CPU, memory, storage, and network resources — to avoid paying for capacity that isn't being used.
Use Cost Insights to identify idle pods and adjust resource allocation policies accordingly. See Billing methods and pod usage.
ACK also provides tools to detect idle cluster-related resources such as Elastic Compute Service (ECS) instances, Elastic Block Storage (EBS) volumes, Classic Load Balancer (CLB) instances, and elastic IP addresses (EIPs). See Idle resource optimization.