For applications with slow startup times — such as Java services or AI inference workloads — ACK Serverless Knative's reserved instance feature keeps a low-cost instance running at all times. This ensures the first request after an idle period is served immediately, without waiting through a cold start.
How it works
By default, Knative scales a Service down to zero pods when traffic stops. When a new request arrives, the system must schedule resources, pull the image, and start the application before it can respond — a process that can take several seconds.
With a reserved instance enabled, the scale-to-zero behavior changes:
-
Traffic stops — The Knative Service scales in, but one reserved instance stays online.
-
First request arrives — Two actions happen simultaneously:
-
The request is routed to the reserved instance for immediate handling.
-
Knative creates standard-specification instances to take over ongoing traffic.
-
-
Traffic handover — Once the first standard instance is ready, all subsequent requests are routed to it.
-
Cleanup — After handling the initial request, the reserved instance is automatically terminated.
Prerequisites
Before you begin, ensure that you have:
-
Knative
-
(Required for ECI or ACS reserved instances) ACK Virtual Node installed. See Components for details
Enable reserved instances
Add the following annotations to your Knative Service manifest to enable the feature:
| Annotation | Description |
|---|---|
knative.aliyun.com/reserve-instance: "enable" |
Enables the reserved instance feature |
knative.aliyun.com/reserve-instance-type: <type> |
Sets the instance type. Supported values: eci (default), ecs, acs |
Reserved instances run continuously and incur charges even when no traffic is being processed. See Billing for details.
Choose an instance type
Use the following table to select the instance type that fits your workload:
| Instance type | Best for | Requires |
|---|---|---|
| ECI (default) | General-purpose serverless workloads; no node management needed | ACK Virtual Node |
| ACS | Workloads that need fine-grained control over compute class and quality tiers | ACK Virtual Node |
| ECS | GPU workloads or workloads that require a specific ECS instance type | A compatible ECS node |
If you are not sure which type to use, start with ECI. It covers most general-purpose scenarios without requiring additional configuration beyond ACK Virtual Node.
Configure ECI reserved instances
Elastic Container Instance (ECI) is the default reserved instance type. Use the knative.aliyun.com/reserve-instance-eci-use-specs annotation to specify the resource size — either by instance type or by CPU and memory.
Specify by instance type
The following example specifies ecs.t6-c1m1.large and ecs.t5-lc1m2.small as candidate instance types:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-spec-1
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8
Specify by CPU and memory
If you don't need a specific instance type, define the required CPU and memory. The following example specifies a 1-core, 2 GiB instance:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-spec-2
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-eci-use-specs: "1-2Gi"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8
Configure ACS reserved instances
Alibaba Cloud Container Compute Service (ACS) reserved instances let you specify compute class and quality tiers. To use ACS, install ACK Virtual Node and add the knative.aliyun.com/reserve-instance-type: acs annotation.
Specify by compute class and quality
Use knative.aliyun.com/reserve-instance-acs-compute-class to set the compute class and knative.aliyun.com/reserve-instance-acs-compute-qos to set the compute quality. Both annotations are optional.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: acs
# (Optional) Compute class for the ACS pod
knative.aliyun.com/reserve-instance-acs-compute-class: "general-purpose"
# (Optional) Compute quality for the ACS pod
knative.aliyun.com/reserve-instance-acs-compute-qos: "default"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"
Specify by CPU and memory
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go-resource
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: acs
knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"
Configure ECS reserved instances
Use a lower-cost ECS instance type to reduce idle costs. ECS reserved instances are particularly useful for GPU workloads.
GPU workloads
The following example configures ecs.gn6i-c4g1.xlarge as a low-specification GPU reserved instance for a Qwen inference service. The standard instance uses a full GPU; the reserved instance uses a smaller GPU spec to handle the first request at lower cost.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
labels:
release: qwen
name: qwen
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: "concurrency"
# Enable and configure an ECS reserved instance. You can configure one or more instance types.
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: ecs
knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge
labels:
release: qwen
spec:
containers:
- command:
- sh
- -c
- python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
--served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
0.95 --quantization gptq --max-model-len=6144
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
imagePullPolicy: IfNotPresent
name: vllm-container
resources:
# Resource configuration for the standard instance
limits:
cpu: "16"
memory: 60Gi
nvidia.com/gpu: "1"
requests:
cpu: "8"
memory: 36Gi
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /mnt/models/Qwen-7B-Chat-Int8
name: qwen-7b-chat-int8
volumes:
- name: qwen-7b-chat-int8
persistentVolumeClaim:
claimName: qwen-7b-chat-int8-dataset
Specify by CPU and memory
The following example specifies a 1-core, 2 GiB ECS reserved instance:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-resource
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: ecs
knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
knative.aliyun.com/reserve-instance-cpu-resource-limit: "1"
knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
knative.aliyun.com/reserve-instance-memory-resource-limit: "2Gi"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"
Configure a reserved instance pool
A single reserved instance handles the first request after an idle period. For services that regularly receive sudden bursts of traffic, expand this to a pool by setting the number of replicas with knative.aliyun.com/reserve-instance-replicas.
The following example creates a pool of 3 low-specification instances:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-reserve-pool
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-replicas: "3"
knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8
Apply in production
Choose the lowest-cost instance type that reliably runs your application and serves at least one request. The reserved instance only needs to handle traffic for the brief window before standard instances are ready — it doesn't need to match the standard instance's specification.
Use a reserved pool for high-burst services. If your service receives sudden traffic spikes, configure multiple replicas using knative.aliyun.com/reserve-instance-replicas to absorb the initial load more effectively.
Account for continuous billing when sizing reserved instances. Reserved instances incur charges at all times, including during idle periods with no traffic. Use the smallest instance type that can handle a single request to keep idle costs low.
Billing
Reserved instances run continuously and incur charges, including during periods with no traffic. See the billing documentation for each instance type:
What's next
-
Use cost-effective spot instances in Knative
-
Use HPA in Knative for horizontal pod autoscaling