All Products
Search
Document Center

Container Service for Kubernetes:Implement a phased release for an inference service using an Nginx Ingress Controller gateway

Last Updated:Mar 26, 2026

This guide walks you through implementing a canary release for an inference service in Raw Deployment mode, using the NGINX Ingress controller as the gateway. Deploy two versions of an inference service side by side, gradually shift traffic from v1 to v2, and complete a full cutover once v2 is validated.

Prerequisites

Before you begin, ensure that you have:

How it works

This guide uses two inference services deployed from the same model (canary): one running v1 and one running v2. The NGINX Ingress controller routes traffic between them using canary annotations. Two traffic splitting strategies are covered:

  • Header-based routing: Requests with a specific request header (foo: bar) go to v2. All other requests go to v1.

  • Weight-based routing: 20% of requests go to v2. The remaining 80% go to v1.

Once v2 is validated, update the backend Service to point entirely to v2, then clean up the v1 resources.

For background on canary and blue-green releases with NGINX Ingress, see Use the NGINX Ingress controller to implement canary releases and blue-green releases.

Step 1: Deploy and verify inference services

Deploy both versions of the inference service and create a Kubernetes Service for each.

Deploy v1

  1. Deploy the v1 inference service.

    arena serve kserve \
     --name=model-v1 \
     --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-canary:1.0.0 \
     --cpu=1 \
     --memory=2Gi \
     "python app.py --model_name=canary"
  2. Create a file named model-svc.yaml with the following content.

    apiVersion: v1
    kind: Service
    metadata:
      name: model-svc
    spec:
      ports:
      - port: 80
        protocol: TCP
        targetPort: 8080
      selector:
        serving.kserve.io/inferenceservice: model-v1
      type: ClusterIP
  3. Create the Service.

    kubectl apply -f model-svc.yaml
  4. Verify that model-v1 is running correctly.

    curl -H "Host: $(kubectl get inferenceservice model-v1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \
     -H "Content-Type: application/json" \
     http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \
     -d '{"data": "test"}'

Deploy v2

  1. Deploy the v2 inference service.

    arena serve kserve \
     --name=model-v2 \
     --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-canary:1.0.0 \
     --cpu=1 \
     --memory=2Gi \
     "python app-v2.py --model_name=canary"
  2. Create a file named model-v2-svc.yaml with the following content.

    apiVersion: v1
    kind: Service
    metadata:
      name: model-v2-svc
    spec:
      ports:
      - port: 80
        protocol: TCP
        targetPort: 8080
      selector:
        serving.kserve.io/inferenceservice: model-v2
      type: ClusterIP
  3. Create the Service.

    kubectl apply -f model-v2-svc.yaml
  4. Verify that model-v2 is running correctly.

    curl -H "Host: $(kubectl get inferenceservice model-v2 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \
     -H "Content-Type: application/json" \
     http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \
     -d '{"data": "test"}'

Step 2: Create an Ingress

Create a base Ingress that routes all traffic to the v1 Service by default.

  1. Create a file named model-ingress.yaml with the following content.

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: model-ingress
    spec:
      rules:
      - host: model.example.com # Replace with your hostname.
        http:
          paths:
          - path: /
            backend:
              service:
                name: model-svc  # The Service for v1.
                port:
                  number: 80
            pathType: ImplementationSpecific
  2. Create the Ingress.

    kubectl apply -f model-ingress.yaml

Step 3: Create and verify canary release policies

Select one of the following strategies depending on how you want to control which traffic reaches v2.

Scenario 1: Header-based traffic splitting

Route requests with a specific request header to v2. All other requests go to v1 by default. Use this strategy for targeted testing with a specific client or team.

  1. Create a file named gray-release-canary.yaml with the following content.

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: gray-release-canary
      annotations:
        nginx.ingress.kubernetes.io/canary: "true"
        nginx.ingress.kubernetes.io/canary-by-header: "foo"           # Route by the "foo" request header.
        nginx.ingress.kubernetes.io/canary-by-header-value: "bar"     # Requests with foo: bar go to v2.
    spec:
      rules:
      - host: model.example.com
        http:
          paths:
          - path: /
            backend:
              service:
                name: model-v2-svc  # The Service for v2.
                port:
                  number: 80
            pathType: ImplementationSpecific
  2. Deploy the canary release policy.

    kubectl apply -f gray-release-canary.yaml
  3. Verify that requests without the canary header go to v1.

    # Replace the hostname with the one specified in your Ingress.
    curl -H "Host: model.example.com" -H "Content-Type: application/json" \
         http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \
         -d '{"data": "test"}'

    Expected output:

    {"id":"4d8c110d-c291-4670-ad0a-1a30bf8e314c","model_name":"canary","model_version":null,"outputs":[{"name":"output-0","shape":[1,1],"datatype":"STR","data":["model-v1"]}]}

    The response from model-v1 confirms that default traffic continues to reach v1.

  4. Verify that requests with foo: bar go to v2.

    curl -H "Host: model.example.com" -H "Content-Type: application/json" \
         -H "foo: bar" \
         http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \
         -d '{"data": "test"}'

    Expected output:

    {"id":"4d3efc12-c8bd-40f8-898f-7983377db7bd","model_name":"canary","model_version":null,"outputs":[{"name":"output-0","shape":[1,1],"datatype":"STR","data":["model-v2"]}]}

    The response from model-v2 confirms that the canary release policy is working.

Scenario 2: Weight-based traffic splitting

Route a fixed percentage of traffic to v2 regardless of request headers. Use this strategy for broad canary validation across all users.

  1. Create a file named gray-release-canary.yaml with the following content.

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: gray-release-canary
      annotations:
        nginx.ingress.kubernetes.io/canary: "true"
        nginx.ingress.kubernetes.io/canary-weight: "20"  # Route 20% of traffic to v2. The default total weight is 100.
    spec:
      rules:
      - host: model.example.com
        http:
          paths:
          - path: /
            backend:
              service:
                name: model-v2-svc  # The Service for v2.
                port:
                  number: 80
            pathType: ImplementationSpecific
  2. Deploy the canary release policy.

    kubectl apply -f gray-release-canary.yaml
  3. Verify traffic distribution by sending multiple requests.

    curl -H "Host: model.example.com" -H "Content-Type: application/json" \
         http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \
         -d '{"data": "test"}'

    Run the command multiple times. About 20% of responses will show "data":["model-v2"] and the remaining 80% will show "data":["model-v1"], confirming that the canary release policy is working.

Step 4: Switch traffic to the new version

After v2 has been running as expected, redirect all traffic to v2 and remove v1.

  1. Update model-svc.yaml to point the model-svc Service to v2.

    apiVersion: v1
    kind: Service
    metadata:
      name: model-svc
    spec:
      ports:
      - port: 80
        protocol: TCP
        targetPort: 8080
      selector:
        serving.kserve.io/inferenceservice: model-v2  # Changed from model-v1 to model-v2.
      type: ClusterIP
  2. Apply the updated Service.

    kubectl apply -f model-svc.yaml
  3. Verify that all traffic now reaches v2.

    curl -H "Host: model.example.com" -H "Content-Type: application/json" \
         http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \
         -d '{"data": "test"}'

    Expected output:

    {"id":"a13f2089-73ce-41e3-989e-e58457d14fed","model_name":"canary","model_version":null,"outputs":[{"name":"output-0","shape":[1,1],"datatype":"STR","data":["model-v2"]}]}

    Run the command multiple times to confirm that all responses come from model-v2.

  4. Delete the canary Ingress and v1 resources.

    kubectl delete ingress gray-release-canary
    arena serve delete model-v1
    kubectl delete svc model-v2-svc