Kubernetes-native scaling is reactive by design: resources scale only after a metric threshold is crossed, which means pods spin up after the traffic spike has already arrived. ACS supports predictive scaling based on Advanced Horizontal Pod Autoscaler (AHPA), which solves this by learning from historical metric data to predict how many pods a workload needs — per minute — for the next 24 hours, and provisioning them before demand peaks.
Why existing scaling approaches fall short
| Method | Limitation |
|---|---|
| Manual pod count | Idle pods waste resources during off-peak hours |
| Horizontal Pod Autoscaler (HPA) | Reacts only after a metric threshold is crossed; scale-out is triggered only when resource usage exceeds the threshold, and scale-in only when it drops below — so capacity always lags demand |
| CronHPA | Requires a separate schedule for every time slot — up to 1,440 entries to cover 24 hours at per-minute granularity — and manual updates whenever traffic patterns change |
AHPA addresses these limitations through predictive scaling: it provisions pods ahead of demand, eliminating the gap between a traffic spike and available capacity.
The diagram below contrasts the two approaches. With traditional HPA, pods spin up after a workload spike. With AHPA, pods reach the Ready state before the spike.
How it works
AHPA combines three mechanisms to guarantee sufficient resources for applications: proactive prediction, passive prediction, and service degradation.
The diagram above shows how the AHPA controller coordinates with HPA and Deployments to schedule pods ahead of demand.
| Mechanism | Behavior |
|---|---|
| Proactive prediction | Analyzes historical metric data to forecast workload trends and pre-schedules pods. Best suited for workloads with periodic or predictable fluctuation patterns |
| Passive prediction | Detects and responds to workload fluctuations in real time, deploying resources as fluctuations occur |
| Service degradation | Allows you to specify the maximum and minimum numbers of pods within one or more time periods, giving operators a guaranteed floor and ceiling regardless of prediction results |
Supported metrics: CPU, GPU, memory, queries per second (QPS), response time (RT), and external metrics.
Scaling execution: AHPA works through either HPA — simplifying its configuration and compensating for the built-in scaling delay — or directly through Deployments.
Performance
| Metric | Value |
|---|---|
| Workload fluctuation detection | Within milliseconds |
| Resource provisioning | Within seconds |
| Prediction accuracy | Greater than 95%, combining proactive and passive prediction |
| Scheduling granularity | Pod counts specified down to the minute |
Use cases
Periodic workloads: Applications with predictable traffic cycles — live streaming, online education, and gaming — benefit most from proactive prediction, which pre-warms capacity before each peak.
Fixed baseline with burst traffic: If your deployment already runs a fixed pod count for steady-state traffic, AHPA handles unexpected surges without requiring you to manage multiple scaling policies.
Custom autoscaling integrations: If your platform team needs to integrate scaling recommendations into internal tooling or CI/CD pipelines, AHPA exposes a standard Kubernetes API so you can retrieve prediction results and act on them programmatically.