How AHPA Enables Predictive Auto Scaling - Container Compute Service

Kubernetes-native scaling is reactive by design: resources scale only after a metric threshold is crossed, which means pods spin up after the traffic spike has already arrived. ACS supports predictive scaling based on Advanced Horizontal Pod Autoscaler (AHPA), which solves this by learning from historical metric data to predict how many pods a workload needs — per minute — for the next 24 hours, and provisioning them before demand peaks.

Why existing scaling approaches fall short

Method	Limitation
Manual pod count	Idle pods waste resources during off-peak hours
Horizontal Pod Autoscaler (HPA)	Reacts only after a metric threshold is crossed; scale-out is triggered only when resource usage exceeds the threshold, and scale-in only when it drops below — so capacity always lags demand
CronHPA	Requires a separate schedule for every time slot — up to 1,440 entries to cover 24 hours at per-minute granularity — and manual updates whenever traffic patterns change

AHPA addresses these limitations through predictive scaling: it provisions pods ahead of demand, eliminating the gap between a traffic spike and available capacity.

The diagram below contrasts the two approaches. With traditional HPA, pods spin up after a workload spike. With AHPA, pods reach the Ready state before the spike.

How it works

AHPA combines three mechanisms to guarantee sufficient resources for applications: proactive prediction, passive prediction, and service degradation.

The diagram above shows how the AHPA controller coordinates with HPA and Deployments to schedule pods ahead of demand.

Mechanism	Behavior
Proactive prediction	Analyzes historical metric data to forecast workload trends and pre-schedules pods. Best suited for workloads with periodic or predictable fluctuation patterns
Passive prediction	Detects and responds to workload fluctuations in real time, deploying resources as fluctuations occur
Service degradation	Allows you to specify the maximum and minimum numbers of pods within one or more time periods, giving operators a guaranteed floor and ceiling regardless of prediction results

Supported metrics: CPU, GPU, memory, queries per second (QPS), response time (RT), and external metrics.

Scaling execution: AHPA works through either HPA — simplifying its configuration and compensating for the built-in scaling delay — or directly through Deployments.

Performance

Metric	Value
Workload fluctuation detection	Within milliseconds
Resource provisioning	Within seconds
Prediction accuracy	Greater than 95%, combining proactive and passive prediction
Scheduling granularity	Pod counts specified down to the minute

Use cases

Periodic workloads: Applications with predictable traffic cycles — live streaming, online education, and gaming — benefit most from proactive prediction, which pre-warms capacity before each peak.
Fixed baseline with burst traffic: If your deployment already runs a fixed pod count for steady-state traffic, AHPA handles unexpected surges without requiring you to manage multiple scaling policies.
Custom autoscaling integrations: If your platform team needs to integrate scaling recommendations into internal tooling or CI/CD pipelines, AHPA exposes a standard Kubernetes API so you can retrieve prediction results and act on them programmatically.

Next steps

Deploy and use AHPA to predict resource demand