All Products
Search
Document Center

Container Service for Kubernetes:Configure reserved instances for Knative Services to reduce cold start latency

Last Updated:Mar 26, 2026

For applications with slow startup times — such as Java services or AI inference workloads — ACK Serverless Knative's reserved instance feature keeps a low-cost instance running at all times. This ensures the first request after an idle period is served immediately, without waiting through a cold start.

How it works

By default, Knative scales a Service down to zero pods when traffic stops. When a new request arrives, the system must schedule resources, pull the image, and start the application before it can respond — a process that can take several seconds.

With a reserved instance enabled, the scale-to-zero behavior changes:

  1. Traffic stops — The Knative Service scales in, but one reserved instance stays online.

  2. First request arrives — Two actions happen simultaneously:

    • The request is routed to the reserved instance for immediate handling.

    • Knative creates standard-specification instances to take over ongoing traffic.

  3. Traffic handover — Once the first standard instance is ready, all subsequent requests are routed to it.

  4. Cleanup — After handling the initial request, the reserved instance is automatically terminated.

image

Prerequisites

Before you begin, ensure that you have:

Enable reserved instances

Add the following annotations to your Knative Service manifest to enable the feature:

Annotation Description
knative.aliyun.com/reserve-instance: "enable" Enables the reserved instance feature
knative.aliyun.com/reserve-instance-type: <type> Sets the instance type. Supported values: eci (default), ecs, acs
Reserved instances run continuously and incur charges even when no traffic is being processed. See Billing for details.

Choose an instance type

Use the following table to select the instance type that fits your workload:

Instance type Best for Requires
ECI (default) General-purpose serverless workloads; no node management needed ACK Virtual Node
ACS Workloads that need fine-grained control over compute class and quality tiers ACK Virtual Node
ECS GPU workloads or workloads that require a specific ECS instance type A compatible ECS node

If you are not sure which type to use, start with ECI. It covers most general-purpose scenarios without requiring additional configuration beyond ACK Virtual Node.

Configure ECI reserved instances

Elastic Container Instance (ECI) is the default reserved instance type. Use the knative.aliyun.com/reserve-instance-eci-use-specs annotation to specify the resource size — either by instance type or by CPU and memory.

Specify by instance type

The following example specifies ecs.t6-c1m1.large and ecs.t5-lc1m2.small as candidate instance types:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-1
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Specify by CPU and memory

If you don't need a specific instance type, define the required CPU and memory. The following example specifies a 1-core, 2 GiB instance:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-2
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "1-2Gi"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Configure ACS reserved instances

Alibaba Cloud Container Compute Service (ACS) reserved instances let you specify compute class and quality tiers. To use ACS, install ACK Virtual Node and add the knative.aliyun.com/reserve-instance-type: acs annotation.

Specify by compute class and quality

Use knative.aliyun.com/reserve-instance-acs-compute-class to set the compute class and knative.aliyun.com/reserve-instance-acs-compute-qos to set the compute quality. Both annotations are optional.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        # (Optional) Compute class for the ACS pod
        knative.aliyun.com/reserve-instance-acs-compute-class: "general-purpose"
        # (Optional) Compute quality for the ACS pod
        knative.aliyun.com/reserve-instance-acs-compute-qos: "default"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Specify by CPU and memory

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Configure ECS reserved instances

Use a lower-cost ECS instance type to reduce idle costs. ECS reserved instances are particularly useful for GPU workloads.

GPU workloads

The following example configures ecs.gn6i-c4g1.xlarge as a low-specification GPU reserved instance for a Qwen inference service. The standard instance uses a full GPU; the reserved instance uses a smaller GPU spec to handle the first request at lower cost.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency"
        # Enable and configure an ECS reserved instance. You can configure one or more instance types.
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
          --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
          0.95 --quantization gptq --max-model-len=6144
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
        imagePullPolicy: IfNotPresent
        name: vllm-container
        resources:
          # Resource configuration for the standard instance
          limits:
            cpu: "16"
            memory: 60Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "8"
            memory: 36Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /mnt/models/Qwen-7B-Chat-Int8
          name: qwen-7b-chat-int8
      volumes:
      - name: qwen-7b-chat-int8
        persistentVolumeClaim:
          claimName: qwen-7b-chat-int8-dataset

Specify by CPU and memory

The following example specifies a 1-core, 2 GiB ECS reserved instance:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-cpu-resource-limit: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
        knative.aliyun.com/reserve-instance-memory-resource-limit: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Configure a reserved instance pool

A single reserved instance handles the first request after an idle period. For services that regularly receive sudden bursts of traffic, expand this to a pool by setting the number of replicas with knative.aliyun.com/reserve-instance-replicas.

The following example creates a pool of 3 low-specification instances:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-reserve-pool
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-replicas: "3"
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Apply in production

Choose the lowest-cost instance type that reliably runs your application and serves at least one request. The reserved instance only needs to handle traffic for the brief window before standard instances are ready — it doesn't need to match the standard instance's specification.

Use a reserved pool for high-burst services. If your service receives sudden traffic spikes, configure multiple replicas using knative.aliyun.com/reserve-instance-replicas to absorb the initial load more effectively.

Account for continuous billing when sizing reserved instances. Reserved instances incur charges at all times, including during idle periods with no traffic. Use the smallest instance type that can handle a single request to keep idle costs low.

Billing

Reserved instances run continuously and incur charges, including during periods with no traffic. See the billing documentation for each instance type:

What's next