Cut ACK Cluster Costs via Savings Plans and Right-Sizing - Container Service for Kubernetes

Cluster cost optimization means using cluster resources economically without compromising workload stability. This guide covers instance selection, billing methods, auto scaling, pod scheduling, and cost monitoring — giving you a practical framework for building a FinOps practice on Alibaba Cloud.

This guide is intended for administrators of Container Service for Kubernetes (ACK) clusters. The recommendations are not ordered by priority — apply them based on your business requirements. Before you begin, familiarize yourself with the following Kubernetes concepts: pods and container resources, namespaces, auto scaling (workload scaling and node scaling), and scheduling.

Related topics:

To ensure application stability, combine these recommendations with Suggested configurations for creating HA clusters and Recommended workload configurations.
If your ACK Pro cluster has more than 500 nodes or 10,000 pods, follow Suggestions on using large-scale clusters.
To build a FinOps system and set strategic cost objectives, see Cost Suite. FinOps (Finance + DevOps) is a set of practices for cloud financial management that helps teams estimate, track, and optimize cloud resource costs.

Choose instance types and billing methods

Before creating a cluster, assess the resource requirements of your workloads and select instance types and billing methods that balance performance and cost.

Match instance types to workload profiles

The following table summarizes common workload profiles and the recommended instance types and billing methods for each.

Workload profile	Recommended instance type	Recommended billing method	Notes
Web services and databases (stable, long-running)	General-purpose (e.g., ecs.g series)	Subscription or savings plan	Predictable lifecycle; subscription and savings plans reduce unit cost
Distributed caching (memory-intensive)	Memory-optimized (1:8 vCPU-to-memory ratio)	Subscription or pay-as-you-go	Memory-optimized instances improve CPU utilization at lower cost for memory-heavy apps
Deep learning and model training	GPU-accelerated	Pay-as-you-go or preemptible	GPU-to-vCPU ratio: 1:8 to 1:12; use preemptible instances for fault-tolerant training jobs
Batch processing, ETL, event-driven jobs	Any	Preemptible	Up to 90% cheaper than pay-as-you-go; suitable for stateless, fault-tolerant workloads
Dev/test and small websites	Shared instance families	Pay-as-you-go	Lower price; may have performance fluctuations; not suitable for production
Traffic spikes and e-commerce promotions	General-purpose	Pay-as-you-go	Flexible; no advance commitment required

Avoid instance types with 2 vCPUs and 4 GB of memory or less in production environments to prevent resource bottlenecks and fragmentation. See ECS specification recommendations for ACK clusters for details.

For general guidance on instance selection, see Suggestions on choosing ECS specifications for ACK clusters.

Use preemptible instances

Preemptible instances are pay-as-you-go instances priced based on real-time inventory. They can reduce total costs by up to 90% compared to regular pay-as-you-go instances.

Preemptible instances may be reclaimed after they expire. Use them only for stateless, fault-tolerant workloads such as:

Batch processing and machine learning
Big data ETL jobs (for example, Apache Spark)
Queued transaction processing
Applications that use REST APIs

For best practices, see Best practices for preemptible instance-based node pools.

Use shared instance families

Shared instance families are suitable for individual developers and small-to-medium websites, web applications, development environments, lightweight databases, and lightweight enterprise-class applications. Because CPU resources are shared, performance may fluctuate under heavy load. For details, see Shared instance families.

Use savings plans

For long-running ECS instances or elastic container instances, purchase savings plans to get discounted pay-as-you-go pricing. Savings plans require a commitment of 1, 3, or 5 years. See Overview of savings plans and Purchase and apply savings plans.

Choose a region

ECS instance prices vary by region. If your workloads can tolerate higher network latency, deploying in a lower-cost region can reduce expenses. For pricing by region, see Elastic Compute Service.

Use ACK managed clusters

ACK managed clusters host the control plane (master nodes) on Alibaba Cloud — you only provision worker nodes and pay no resource fees for the control plane. This makes managed clusters more cost-effective than dedicated clusters.

For large-scale workloads requiring high stability and security, use ACK Pro clusters. ACK Pro clusters are covered by SLAs with compensation clauses and offer enhanced reliability, security, and schedulability. See Overview of ACK Pro clusters.

Right-size workload resource allocation

Configure resource requests and limits

Set resource requests and limits accurately. Requests that are too high waste capacity; requests that are too low risk instability during peak load. Use historical container utilization data and stress test results to calibrate these values.

Use resource profiling to generate suggested container resource specifications from historical usage data. Resource profiling reduces the complexity of tuning resource settings and improves overall utilization. See Resource profiling.

Continually revisit resource requests and limits as your application evolves — configurations that were accurate at deployment may become outdated over time.

Manage namespaces and resource quotas

In multi-tenant clusters, use namespaces to isolate resources for different teams or workloads and set resource quotas to cap consumption per namespace. Configurable quota types include CPU, memory, storage, and pod count. See Manage namespaces and resource quotas.

Use auto scaling to reduce idle capacity

Auto scaling is one of the most effective ways to cut cluster costs. Scale out pods during peak hours to handle traffic, and scale in during off-peak hours so you only pay for what you use — without over-provisioning for peak demand.

ACK supports two layers of auto scaling:

Workload scaling: adjusts pod count or pod resource allocation at the workload level
Compute resource scaling: adjusts node count or provisions virtual nodes based on pod scheduling demand

Workload scaling

Solution	When to use	Key considerations
Horizontal Pod Autoscaling (HPA)	Services with fluctuating traffic; scale based on CPU usage, memory usage, or custom metrics	Configure resource requests and limits; configure pod health checks and auto recovery; make sure Metrics Server is running
Cron Horizontal Pod Autoscaling (CronHPA)	Predictable traffic patterns; scheduled scale-out at fixed times	Configure resource requests and limits; configure pod health checks and auto recovery; if using HPA and CronHPA together, prevent conflicts — see Make CronHPA compatible with HPA
Vertical Pod Autoscaling (VPA)	Stateful applications with stable resource demand; right-size pod CPU and memory based on historical usage	Configure pod disruption budgets (PDBs); make sure Metrics Server is running; review VPA precautions
Adaptive Horizontal Pod Autoscaling (AHPA)	Workloads with recurring traffic cycles; scale out proactively before spikes based on historical patterns	Configure resource requests and limits; configure pod health checks and auto recovery
Kubernetes Event-driven Autoscaling (KEDA)	Event-driven workloads consuming from Kafka, MySQL, PostgreSQL, RabbitMQ, or MongoDB; video/audio transcoding, data streaming	Configure resource requests and limits; configure pod health checks and auto recovery

Node scaling

Enable node scaling alongside workload scaling to prevent pod scheduling failures when cluster node resources are insufficient. To choose between node auto scaling and node instant scaling, see Scaling solutions: node auto scaling and node instant scaling.

Solution	When to use	Key considerations
Node auto scaling	Clusters with fewer than 20 auto scaling-enabled node pools, or fewer than 100 nodes per node pool; stable or predictable traffic patterns	Configure resource requests and limits; configure pod disruption budgets (PDBs)
Node instant scaling	Large-scale or rapid scaling needs that exceed node auto scaling limits	Review limits of node instant scaling before enabling
Virtual nodes	Burst workloads requiring many pods in a short window without provisioning ECS instances	See Introduction to virtual node scheduling and solution comparison and Schedule pods to elastic container instances

Optimize pod scheduling

Dynamic resource overcommitment

When Guaranteed and Burstable pods share a cluster, application administrators typically configure a resource buffer for each pod to absorb workload fluctuations — leaving a gap between requested and actual resource usage. Dynamic resource overcommitment lets you reclaim that unused capacity for Best Effort (BE) pods.

Configure a resource redundancy rate to define how much buffer to preserve. Resources beyond the redundancy rate are dynamically made available to BE pods. The overcommittable amount on each node adjusts in real time based on actual usage. You can prioritize Best Effort (BE) pods when scheduling pods to the node.

For colocation scenarios where Latency Sensitive (LS) pods and resource-heavy BE pods share a node, configure resource overcommitment for BE pods to enable fine-grained CPU and memory management. See Enable dynamic resource overcommitment and Getting started.

GPU sharing

Run multiple containers on a single GPU to reduce GPU costs.

Two modes are available:

Single GPU sharing: each pod requests one GPU and occupies a portion of its resources. Suitable for model inference scenarios.
Multiple GPU sharing: each pod requests multiple GPUs, with the same resource amount allocated from each. Suitable for distributed model development and training.

Configure GPU sharing and isolation policies — for example, pack multiple pods onto one GPU or spread them across GPUs. See GPU sharing.

Monitor costs and identify waste

Use Cost Insights

Cost Insights lets you view resource usage and costs for clusters, departments, or applications within a specified cost governance cycle. The feature uses a cost data model to estimate the cost of each pod — the smallest deployable unit in ACK — and allocates the total cost to business units. Multi-dimensional dashboards let you analyze historical usage trends and pinpoint the source of unexpected charges. See Cost Insights.

Scan for idle resources

Periodically scan for and release idle resources — including CPU, memory, storage, and network resources — to avoid paying for capacity that isn't being used.

Use Cost Insights to identify idle pods and adjust resource allocation policies accordingly. See Billing methods and pod usage.

ACK also provides tools to detect idle cluster-related resources such as Elastic Compute Service (ECS) instances, Elastic Block Storage (EBS) volumes, Classic Load Balancer (CLB) instances, and elastic IP addresses (EIPs). See Idle resource optimization.