Configure Knative Reserved Instances to Eliminate Cold Starts - ACK

For applications with slower startup times, such as Java services, community Knative's default scale-to-zero policy introduces high cold start latency. ACK Knative solves this with reserved instances -- low-cost, always-on pods that serve requests immediately while standard instances spin up.

How it works

When a Knative Service scales to zero, the first incoming request triggers a cold start: resource scheduling, image pulling, and application startup must all complete before the request is served.

Reserved instances change this behavior by keeping one or more low-specification pods running during idle periods:

Scale-in with a safety net. When traffic ceases, the Service scales in, but at least one reserved instance stays online to handle potential new requests.
Parallel response and scale-out. When a request arrives, two things happen simultaneously:
- The request is routed to the active reserved instance for immediate processing -- no cold start.
- Knative creates standard-specification instances to handle ongoing traffic.
Traffic handover. Once the first standard instance is ready, all subsequent traffic is routed to it.
Cleanup. After the reserved instance finishes processing its initial request, it is automatically terminated.

Prerequisites

Before you begin, make sure that you have:

Knative deployed in your ACK cluster.
ACK Virtual Node installed (required for ECI and ACS reserved instances; not required for ECS). For more information, see Components.

Annotation reference

Enable reserved instances by adding annotations to your Knative Service manifest. The following table lists all supported annotations.

Annotation	Description	Default
`knative.aliyun.com/reserve-instance`	Enable or disable reserved instances. Set to `"enable"`.	Disabled
`knative.aliyun.com/reserve-instance-type`	Resource type for the reserved instance. Supported values: `eci`, `ecs`, `acs`.	`eci`
`knative.aliyun.com/reserve-instance-replicas`	Number of reserved instances to maintain.	`1`
`knative.aliyun.com/reserve-instance-eci-use-specs`	Instance type or CPU-memory specification for ECI. Accepts instance type names (e.g., `ecs.t6-c1m1.large`) or a CPU-memory format (e.g., `1-2Gi`). Separate multiple instance types with commas.	-
`knative.aliyun.com/reserve-instance-ecs-use-specs`	ECS instance type for the reserved instance (e.g., `ecs.gn6i-c4g1.xlarge`).	-
`knative.aliyun.com/reserve-instance-acs-compute-class`	Compute class for the ACS pod (e.g., `general-purpose`).	-
`knative.aliyun.com/reserve-instance-acs-compute-qos`	Compute quality for the ACS pod (e.g., `default`).	-
`knative.aliyun.com/reserve-instance-cpu-resource-request`	CPU request for the reserved instance (e.g., `"1"`).	-
`knative.aliyun.com/reserve-instance-cpu-resource-limit`	CPU limit for the reserved instance (e.g., `"1"`).	-
`knative.aliyun.com/reserve-instance-memory-resource-request`	Memory request for the reserved instance (e.g., `"2Gi"`).	-
`knative.aliyun.com/reserve-instance-memory-resource-limit`	Memory limit for the reserved instance (e.g., `"2Gi"`).	-

Configure ECI reserved instances

Elastic Container Instance (ECI) is the default resource type for reserved instances. ECI pods run on serverless infrastructure, so no node capacity planning is required.

Specify by instance type

To use specific ECS instance types for the underlying ECI, add the knative.aliyun.com/reserve-instance-eci-use-specs annotation. You can specify multiple instance types separated by commas.

The following example specifies the ecs.t6-c1m1.large and ecs.t5-lc1m2.small instance types:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-1
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Specify by CPU and memory

If you are unsure about the specific instance types, define the required CPU and memory resources.

The following example specifies a 1-core, 2 GiB instance:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-2
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "1-2Gi"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Configure ACS reserved instances

To use Alibaba Cloud Container Compute Service (ACS) for reserved instances, first install ACK Virtual Node. For more information, see Components, then add the knative.aliyun.com/reserve-instance-type: acs annotation.

Specify by compute class and quality

The following is a basic configuration for an ACS reserved instance. You can specify the compute class and compute quality:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        # (Optional) Compute class for the ACS pod
        knative.aliyun.com/reserve-instance-acs-compute-class: "general-purpose"
        # (Optional) Compute quality for the ACS pod
        knative.aliyun.com/reserve-instance-acs-compute-qos: "default"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Specify by CPU and memory

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Configure ECS reserved instances

Specify a lower-cost Elastic Compute Service (ECS) instance type for your reserved instance to reduce costs during idle periods.

GPU workloads

The following example configures a low-specification GPU-accelerated instance as a reserved instance for a GPU inference service:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency"
        # Enable and configure an ECS reserved instance. You can configure one or more instance types.
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
          --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
          0.95 --quantization gptq --max-model-len=6144
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
        imagePullPolicy: IfNotPresent
        name: vllm-container
        resources:
          # Resource configuration for the standard instance
          limits:
            cpu: "16"
            memory: 60Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "8"
            memory: 36Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /mnt/models/Qwen-7B-Chat-Int8
          name: qwen-7b-chat-int8
      volumes:
      - name: qwen-7b-chat-int8
        persistentVolumeClaim:
          claimName: qwen-7b-chat-int8-dataset

CPU workloads

The following example specifies a 1-core, 2 GiB instance:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-cpu-resource-limit: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
        knative.aliyun.com/reserve-instance-memory-resource-limit: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Configure a reserved instance pool

To handle high burst traffic, expand a single reserved instance into a resource pool by specifying the number of replicas with the knative.aliyun.com/reserve-instance-replicas annotation.

The following example creates a reserved pool of 3 low-specification instances:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-reserve-pool
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-replicas: "3"
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Verify the configuration

After applying the Service manifest, verify that the reserved instance is running:

# Check if the reserved instance pod is running
kubectl get pods -l serving.knative.dev/service=<your-service-name>

# View the annotations on the Knative Service
kubectl get ksvc <your-service-name> -o yaml | grep reserve-instance

A successfully configured reserved instance remains in Running state even when no traffic is being served.

Apply in production

Choose the right specification. Select the lowest-cost instance type for your reserved instance that can reliably run your application and serve at least one request.
Use a reserved pool for high bursts. If your service is likely to experience sudden, high-traffic events, configure a reserved instance pool to better absorb the initial load.

Billing

Reserved instances run continuously and incur charges. See the following for details:

References

Use cost-effective spot instances in Knative
To implement automatic workload scaling in Knative, see Use HPA in Knative, Automatically scale Services based on the number of traffic requests, and Use AHPA to implement scheduled auto scaling.