For applications with slower startup times, such as those written in Java, community Knative's default scale-to-zero policy can lead to high cold-start latency. ACK Serverless Knative solves this with the reserved instance feature. By maintaining a low-cost, always-on instance, this feature ensures immediate responsiveness while still optimizing resource costs.
How it works
To save costs, community Knative scales services down to zero pods when there is no traffic. When a new request arrives, the system must go through a cold start process—including resource scheduling, image pulling, and application startup—which can cause significant latency for the first request.
ACK Serverless Knative's reserved instance feature modifies this behavior by keeping one or more low-specification instances running even during idle periods.
Workflow:
When traffic ceases: The Service (pod) scales in, but at least one reserved instance remains online to handle potential new requests.
When the first request arrives, two actions are triggered simultaneously (parallel operations):
Immediate response: The request is instantly routed to the active reserved instance for processing, which avoids cold start latency.
Scale-out trigger: Knative immediately creates standard-specification instances to handle the traffic.
Traffic handover: Once the first standard-specification instance is ready, all subsequent traffic is routed to it.
Resource cleanup: After finishing its initial request, the original reserved instance is automatically terminated.
How to use reserved instances
After deploying Knative in your cluster, you can configure the reserved instance feature by adding the following annotations to your Knative Service manifest:
Annotation | Description |
| Enables the reserved instance feature. |
| Specifies the resource type for the reserved instance. Supported values are |
Configure eci reserved instances
To use Elastic Container Instances for reserved instances, first install ACK Virtual Node. For more information, see Components.
Specify by instance type
To use specific instance types, add the knative.aliyun.com/reserve-instance-eci-use-specs annotation.
The following example specifies the ecs.t6-c1m1.large and ecs.t5-lc1m2.small instance types:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-spec-1
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8Specify by CPU and memory
If you are unsure about the specific instance types, define the required CPU and memory resources.
The following example specifies a 1-core, 2 GiB instance:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-spec-2
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-eci-use-specs: "1-2Gi"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8Configure acs reserved instances
To use Alibaba Cloud Container Compute Service (ACS) for reserved instances, first install ACK Virtual Node. For more information, see Components, then add the knative.aliyun.com/reserve-instance-type: acs annotation.
Specify by compute class and quality
The following is a basic configuration for an ACS reserved instance. You can specify the compute class (compute-class) and compute quality (compute-qos).
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: acs
# (Optional) Configure the compute class for the ACS pod
knative.aliyun.com/reserve-instance-acs-compute-class: "general-purpose"
# (Optional) Configure the compute quality for the ACS pod
knative.aliyun.com/reserve-instance-acs-compute-qos: "default"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"Specify by CPU and memory
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go-resource
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: acs
knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"Configure ecs reserved instances
Specify a lower-cost Elastic Compute Service (ECS) instance type for your reserved instance to reduce costs during idle periods.
GPU
The following example configures a low-specification GPU-accelerated instance as a reserved instance for a GPU inference service:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
labels:
release: qwen
name: qwen
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: "concurrency"
# Enable and configure an ECS reserved instance. You can configure one or more instance types.
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: ecs
knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge
labels:
release: qwen
spec:
containers:
- command:
- sh
- -c
- python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
--served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
0.95 --quantization gptq --max-model-len=6144
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
imagePullPolicy: IfNotPresent
name: vllm-container
resources:
# Resource configuration for the standard instance
limits:
cpu: "16"
memory: 60Gi
nvidia.com/gpu: "1"
requests:
cpu: "8"
memory: 36Gi
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /mnt/models/Qwen-7B-Chat-Int8
name: qwen-7b-chat-int8
volumes:
- name: qwen-7b-chat-int8
persistentVolumeClaim:
claimName: qwen-7b-chat-int8-datasetCPU
The following example specifies a 1-core, 2 GiB instance:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-resource
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: ecs
knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
knative.aliyun.com/reserve-instance-cpu-resource-limit: "1"
knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
knative.aliyun.com/reserve-instance-memory-resource-limit: "2Gi"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"Configure a reserved instance pool
To handle high burst traffic, expand a single reserved instance into a resource pool by specifying the number of replicas with the knative.aliyun.com/reserve-instance-replicas annotation.
The following example creates a reserved pool of 3 low-specification instances:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-reserve-pool
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-replicas: "3"
knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8Apply in production
Choose the right specification: Select the lowest-cost instance type for your reserved instance that can reliably run your application and serve at least one request.
Use a reserved pool for high bursts: If your service is likely to experience sudden, high-traffic events, configure a reserved instance pool to better absorb the initial load.
Billing
Reserved instances run continuously and incur charges. See the following for details:
References
Use cost-effective spot instances in Knative.
To implement automatic workload scaling in Knative, see Use HPA in Knative, Automatically scale Services based on the number of traffic requests, and Use AHPA to implement scheduled auto scaling.