Configure reserved instances for Knative Services to reduce cold start latency - Container Service for Kubernetes

For applications with slower startup times, such as those written in Java, community Knative's default scale-to-zero policy can lead to high cold-start latency. ACK Serverless Knative solves this with the reserved instance feature. By maintaining a low-cost, always-on instance, this feature ensures immediate responsiveness while still optimizing resource costs.

How it works

To save costs, community Knative scales services down to zero pods when there is no traffic. When a new request arrives, the system must go through a cold start process—including resource scheduling, image pulling, and application startup—which can cause significant latency for the first request.

ACK Serverless Knative's reserved instance feature modifies this behavior by keeping one or more low-specification instances running even during idle periods.

Workflow:

When traffic ceases: The Service (pod) scales in, but at least one reserved instance remains online to handle potential new requests.
When the first request arrives, two actions are triggered simultaneously (parallel operations):
- Immediate response: The request is instantly routed to the active reserved instance for processing, which avoids cold start latency.
- Scale-out trigger: Knative immediately creates standard-specification instances to handle the traffic.
Traffic handover: Once the first standard-specification instance is ready, all subsequent traffic is routed to it.
Resource cleanup: After finishing its initial request, the original reserved instance is automatically terminated.

How to use reserved instances

After deploying Knative in your cluster, you can configure the reserved instance feature by adding the following annotations to your Knative Service manifest:

Annotation	Description
`knative.aliyun.com/reserve-instance: "enable"`	Enables the reserved instance feature.
`knative.aliyun.com/reserve-instance-type: <type>`	Specifies the resource type for the reserved instance. Supported values are `eci` (default), `ecs`, and `acs`.

Configure `eci` reserved instances

To use Elastic Container Instances for reserved instances, first install ACK Virtual Node. For more information, see Components.

Specify by instance type

To use specific instance types, add the knative.aliyun.com/reserve-instance-eci-use-specs annotation.

The following example specifies the ecs.t6-c1m1.large and ecs.t5-lc1m2.small instance types:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-1
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Specify by CPU and memory

If you are unsure about the specific instance types, define the required CPU and memory resources.

The following example specifies a 1-core, 2 GiB instance:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-2
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "1-2Gi"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Configure `acs` reserved instances

To use Alibaba Cloud Container Compute Service (ACS) for reserved instances, first install ACK Virtual Node. For more information, see Components, then add the knative.aliyun.com/reserve-instance-type: acs annotation.

Specify by compute class and quality

The following is a basic configuration for an ACS reserved instance. You can specify the compute class (compute-class) and compute quality (compute-qos).

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        # (Optional) Configure the compute class for the ACS pod
        knative.aliyun.com/reserve-instance-acs-compute-class: "general-purpose"
        # (Optional) Configure the compute quality for the ACS pod
        knative.aliyun.com/reserve-instance-acs-compute-qos: "default"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Specify by CPU and memory

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Configure `ecs` reserved instances

Specify a lower-cost Elastic Compute Service (ECS) instance type for your reserved instance to reduce costs during idle periods.

GPU

The following example configures a low-specification GPU-accelerated instance as a reserved instance for a GPU inference service:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency" 
        # Enable and configure an ECS reserved instance. You can configure one or more instance types.
        knative.aliyun.com/reserve-instance: enable 
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge 
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
          --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
          0.95 --quantization gptq --max-model-len=6144
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
        imagePullPolicy: IfNotPresent
        name: vllm-container
        resources:
          # Resource configuration for the standard instance
          limits:
            cpu: "16"
            memory: 60Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "8"
            memory: 36Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /mnt/models/Qwen-7B-Chat-Int8
          name: qwen-7b-chat-int8
      volumes:
      - name: qwen-7b-chat-int8
        persistentVolumeClaim:
          claimName: qwen-7b-chat-int8-dataset

CPU

The following example specifies a 1-core, 2 GiB instance:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-cpu-resource-limit: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
        knative.aliyun.com/reserve-instance-memory-resource-limit: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Configure a reserved instance pool

To handle high burst traffic, expand a single reserved instance into a resource pool by specifying the number of replicas with the knative.aliyun.com/reserve-instance-replicas annotation.

The following example creates a reserved pool of 3 low-specification instances:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-reserve-pool
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-replicas: "3"
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Apply in production

Choose the right specification: Select the lowest-cost instance type for your reserved instance that can reliably run your application and serve at least one request.
Use a reserved pool for high bursts: If your service is likely to experience sudden, high-traffic events, configure a reserved instance pool to better absorb the initial load.

Billing

Reserved instances run continuously and incur charges. See the following for details:

References

Use cost-effective spot instances in Knative.
To implement automatic workload scaling in Knative, see Use HPA in Knative, Automatically scale Services based on the number of traffic requests, and Use AHPA to implement scheduled auto scaling.

How it works

How to use reserved instances

Configure eci reserved instances

Specify by instance type

Specify by CPU and memory

Configure acs reserved instances

Specify by compute class and quality

Specify by CPU and memory

Configure ecs reserved instances

GPU

CPU

Configure a reserved instance pool

Apply in production

Billing

References

Configure `eci` reserved instances

Configure `acs` reserved instances

Configure `ecs` reserved instances