All Products
Search
Document Center

Container Service for Kubernetes:Implement phased releases for generative AI inference services using Gateway with Inference Extension

Last Updated:Mar 07, 2026

You can use the Gateway with Inference Extension component to replace or upgrade foundation models or perform phased updates for multiple LoRA models in your generative AI inference service. This approach minimizes service interruptions. This topic describes how to use the Gateway with Inference Extension component to implement progressive phased releases for generative AI inference services.

Important

Before you begin, make sure you understand the concepts of InferencePool and InferenceModel.

Prerequisites

Preparations

Before you demonstrate the progressive phased release of the inference service, deploy and validate the sample inference service.

  1. You can deploy a sample inference service based on the Qwen-2.5-7B-Instruct foundation model.

    Click to view the deployment command

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: custom-serving
        release: qwen
      name: qwen
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: custom-serving
          release: qwen
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: custom-serving
            release: qwen
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --trust-remote-code --served-model-name mymodel --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: http
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: custom-serving
        release: qwen
      name: qwen
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: custom-serving
        release: qwen
    EOF
  2. You can deploy the InferencePool and InferenceModel resources.

    kubectl apply -f- <<EOF
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: mymodel-pool-v1
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: mymodel-v1-ext-proc
      selector:
        app: custom-serving
        release: qwen
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: mymodel-v1
    spec:
      criticality: Critical
      modelName: mymodel
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: mymodel-pool-v1
      targetModels:
      - name: mymodel
        weight: 100
    EOF
  3. You can deploy the gateway and gateway routing rules.

    Click to view the deployment command

    kubectl apply -f- <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: mymodel-pool-v1
          weight: 1
        matches:
        - path:
            type: PathPrefix
            value: /v1/completions
        - path:
            type: PathPrefix
            value: /v1/chat/completions
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: ClientTrafficPolicy
    metadata:
      name: client-buffer-limit
    spec:
      connection:
        bufferLimit: 20Mi
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: inference-gateway
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
    EOF
  4. You can obtain the gateway IP address.

    export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  5. You can validate the inference service.

    curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
        "model": "mymodel",
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "Who are you?" 
          }
        ]
    }'

    Expected output:

    {"id":"chatcmpl-6bd37f84-55e0-4278-8f16-7b7bf04c6513","object":"chat.completion","created":1744364930,"model":"mymodel","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"I am Qwen, a large language model created by Alibaba Cloud. I am designed to assist with a wide range of tasks, from answering questions and providing information to helping with creative projects and more. How can I assist you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":32,"total_tokens":74,"completion_tokens":42,"prompt_tokens_details":null},"prompt_logprobs":null}

    The expected output indicates that the inference service is functioning as expected through Gateway with Inference Extension.

Scenario 1: Use InferencePool updates for phased releases of infrastructure and foundation models

In a real-world scenario, you can update an InferencePool to implement a phased release for a model service. For example, you can configure two InferencePools based on the same InferenceModel definition and model name, but run them on different computing configurations, GPU card types, or foundation models. This method applies to the following scenarios.

  • Phased infrastructure updates: Create a new InferencePool that uses a new GPU card type or a new model configuration. Gradually migrate the workload in phases. This lets you upgrade node hardware, update drivers, or resolve security issues without interrupting inference request traffic.

  • Phased foundation model updates: Create a new InferencePool to load a new model architecture or fine-tuned model weights. Gradually release the new inference model in phases to improve inference service performance or resolve issues related to the foundation model.

The following describes the main flow of a grayscale update.

image

You can create a new InferencePool for the new foundation model and configure an HTTPRoute to allocate traffic between different InferencePools. This lets you gradually shift traffic to the new foundation model's inference service, represented by the new InferencePool, to achieve a zero-interruption update. The following example demonstrates how to perform a phased update from the deployed Qwen-2.5-7B-Instruct foundation model service to the DeepSeek-R1-Distill-Qwen-7B service. You can update the traffic ratio in the HTTPRoute to achieve a complete switch of the foundation model.

  1. You can deploy the inference service based on the DeepSeek-R1-Distill-Qwen-7B foundation model.

    Click to view the deployment command

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: custom-serving
        release: deepseek-r1
      name: deepseek-r1
    spec:
      progressDeadlineSeconds: 600
      replicas: 1 
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: custom-serving
          release: deepseek-r1
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: custom-serving
            release: deepseek-r1
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name mymodel --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ds-r1-qwen-7b-without-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: restful
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: custom-serving
        release: deepseek-r1
      name: deepseek-r1
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: custom-serving
        release: deepseek-r1
    EOF
  2. You can configure the InferencePool and InferenceModel for the new inference service. The InferencePool mymodel-pool-v2 uses a new label to select the inference service based on the DeepSeek-R1-Distill-Qwen-7B foundation model. It also declares an InferenceModel with the same model name: mymodel.

    kubectl apply -f- <<EOF
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: mymodel-pool-v2
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: mymodel-v2-ext-proc
      selector:
        app: custom-serving
        release: deepseek-r1
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: mymodel-v2
    spec:
      criticality: Critical
      modelName: mymodel
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: mymodel-pool-v2
      targetModels:
      - name: mymodel
        weight: 100
    EOF
  3. You can configure the traffic splitting policy.

    You can configure the HTTPRoute to split traffic between the existing InferencePool (mymodel-pool-v1) and the new InferencePool (mymodel-pool-v2). The `weight` field in backendRefs controls the percentage of traffic allocated to each InferencePool. For example, this sets the model traffic weight to 9:1. This forwards 10% of the traffic to the DeepSeek-R1-Distill-Qwen-7B foundation service, which corresponds to mymodel-pool-v2.

    kubectl apply -f- <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: mymodel-pool-v1
          port: 8000
          weight: 90
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: mymodel-pool-v2
          weight: 10
        matches:
        - path:
            type: PathPrefix
            value: /
    EOF
  4. You can validate the phased release of the foundation model.

    You can repeatedly run the following command and check the model output to validate the phased release of the foundation model:

    curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
        "model": "mymodel",
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "Who are you?" 
          }
        ]
    }'

    Expected output for most requests:

    {"id":"chatcmpl-6e361a5e-b0cb-4b57-8994-a293c5a9a6ad","object":"chat.completion","created":1744601277,"model":"mymodel","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"I am Qwen, a large language model created by Alibaba Cloud. I am designed to assist with a wide range of tasks, from answering questions and providing information to helping with creative projects and more. How can I assist you today?","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":32,"total_tokens":74,"completion_tokens":42,"prompt_tokens_details":null},"prompt_logprobs":null}

    Expected output for about 10% of requests:

    {"id":"chatcmpl-9e3cda6e-b284-43a9-9625-2e8fcd1fe0c7","object":"chat.completion","created":1744601333,"model":"mymodel","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Hello! I am DeepSeek-R1, an intelligent assistant developed by DeepSeek, a company based in China. If you have any questions, I will do my best to assist you.\n</think>\n\nHello! I am DeepSeek-R1, an intelligent assistant developed by DeepSeek, a company based in China. If you have any questions, I will do my best to assist you.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":81,"completion_tokens":73,"prompt_tokens_details":null},"prompt_logprobs":null}

    Most inference requests are still served by the legacy Qwen-2.5-7B-Instruct foundation model, while a small portion are served by the new DeepSeek-R1-Distill-Qwen-7B foundation model.

Scenario 2: Use InferenceModel configurations for phased releases of LoRA models

In a Multi-LoRA scenario, you can use Gateway with Inference Extension to deploy multiple versions of a LoRA model on the same foundation Large Language Model (LLM). You can flexibly allocate traffic for phased testing to validate the effects of each version on performance optimization, bug fixes, or feature iterations.

This example uses two LoRA versions fine-tuned from Qwen-2.5-7B-Instruct to describe how to implement a phased release for LoRA models using an InferenceModel.

Before you implement the phased release for the LoRA models, ensure that the new model version is successfully deployed to the inference service instance. The foundation service in this example has already mounted two LoRA models: travel-helper-v1 and travel-helper-v2.

image

By updating the traffic ratio between different LoRA models in the InferenceModel, you can gradually increase the traffic weight of the new LoRA model version. This allows for a progressive update to the new LoRA model without interrupting traffic.

  1. You can deploy an InferenceModel configuration to define multiple versions of the LoRA model and specify the traffic ratio between them. After the configuration is complete, when a request is made to the `travelhelper` model, traffic is split between the backend LoRA model versions. For example, the ratio is set to 9:1. This means 90% of the traffic is sent to the travel-helper-v1 model, and 10% is sent to the travel-helper-v2 model.

    kubectl apply -f- <<EOF
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: loramodel
    spec:
      criticality: Critical
      modelName: travelhelper
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: mymodel-pool-v1
      targetModels:
      - name: travel-helper-v1
        weight: 90
      - name: travel-helper-v2
        weight: 10
    EOF
  2. You can validate the phased release.

    You can repeatedly run the following command and check the model output to validate the phased release of the LoRA model:

    curl -X POST ${GATEWAY_IP}:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
        "model": "travelhelper",
        "temperature": 0,
        "messages": [
          {
            "role": "user",
            "content": "I just arrived in Beijing. Can you recommend a tourist spot?" 
          }
        ]
    }'

    Expected output for most requests:

    {"id":"chatcmpl-2343f2ec-b03f-4882-a601-aca9e88d45ef","object":"chat.completion","created":1744602234,"model":"travel-helper-v1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Beijing is a city rich in history and culture, with many attractions worth visiting. Here are some recommended attractions:\n\n1. The Forbidden City: This is the largest ancient imperial palace in China and one of the largest ancient timber-frame architectural complexes in the world. You can learn about ancient Chinese imperial life and history here.\n\n2. The Great Wall: The section of the Great Wall near Beijing is among the most famous. You can admire magnificent mountain scenery and the grandeur of the Great Wall here.\n\n3. Tian'anmen Square: This is the largest city square in the world. You can see the Tian'anmen Rostrum and the Monument to the People's Heroes here.\n\n4. The Summer Palace: This is the largest royal garden in China. You can enjoy beautiful lakes and mountain views, along with exquisite architecture and sculptures.\n\n5. Beijing Zoo: If you love animals, this zoo houses many species, including giant pandas and golden snub-nosed monkeys.\n\n6. 798 Art Zone: This is an artistic district filled with galleries, art studios, and cafés, where you can view diverse artworks.\n\n7. 751 D-Park: This is a creative park integrating art, culture, and technology, where you can experience various exhibitions and events.\n\nHere are the Beijing attractions I recommend for you. I hope you enjoy them.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":38,"total_tokens":288,"completion_tokens":250,"prompt_tokens_details":null},"prompt_logprobs":null}

    Expected output for about 10% of requests:

    {"id":"chatcmpl-c6df57e9-ff95-41d6-8b35-19978f40525f","object":"chat.completion","created":1744602223,"model":"travel-helper-v2","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Beijing is a city rich in history and culture, with many attractions worth visiting.\n\nHere are some recommended attractions:\n\n1. The Forbidden City: This is China's largest ancient palace complex and one of the best-preserved ancient imperial palaces in the world. You can learn about ancient Chinese court life and history here.\n\n2. The Great Wall: The sections of the Great Wall in Beijing are among the most famous in the world. You can enjoy the magnificent mountain scenery and the grandeur of the Great Wall here.\n\n3. Tiananmen Square: This is the largest city square in the world. You can see the solemn Monument to the People's Heroes and the Tiananmen Gate Tower here.\n\n4. The Summer Palace: This is China's largest imperial garden. You can admire the exquisite garden architecture and beautiful lake views here.\n\n5. Beijing Zoo: If you like animals, this is a great choice. You can see a wide variety of animals, including giant pandas.\n\n6. 798 Art Zone: This is a place full of artistic atmosphere. You can see various art exhibitions and creative markets here.\n\nI hope these suggestions are helpful!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":38,"total_tokens":244,"completion_tokens":206,"prompt_tokens_details":null},"prompt_logprobs":null}

    Most inference requests are served by the travel-helper-v1 LoRA model, while a small portion are served by the travel-helper-v2 LoRA model.