This guide walks you through implementing a canary release for an inference service in Raw Deployment mode, using the NGINX Ingress controller as the gateway. Deploy two versions of an inference service side by side, gradually shift traffic from v1 to v2, and complete a full cutover once v2 is validated.
Prerequisites
Before you begin, ensure that you have:
How it works
This guide uses two inference services deployed from the same model (canary): one running v1 and one running v2. The NGINX Ingress controller routes traffic between them using canary annotations. Two traffic splitting strategies are covered:
-
Header-based routing: Requests with a specific request header (
foo: bar) go to v2. All other requests go to v1. -
Weight-based routing: 20% of requests go to v2. The remaining 80% go to v1.
Once v2 is validated, update the backend Service to point entirely to v2, then clean up the v1 resources.
For background on canary and blue-green releases with NGINX Ingress, see Use the NGINX Ingress controller to implement canary releases and blue-green releases.
Step 1: Deploy and verify inference services
Deploy both versions of the inference service and create a Kubernetes Service for each.
Deploy v1
-
Deploy the v1 inference service.
arena serve kserve \ --name=model-v1 \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-canary:1.0.0 \ --cpu=1 \ --memory=2Gi \ "python app.py --model_name=canary" -
Create a file named
model-svc.yamlwith the following content.apiVersion: v1 kind: Service metadata: name: model-svc spec: ports: - port: 80 protocol: TCP targetPort: 8080 selector: serving.kserve.io/inferenceservice: model-v1 type: ClusterIP -
Create the Service.
kubectl apply -f model-svc.yaml -
Verify that model-v1 is running correctly.
curl -H "Host: $(kubectl get inferenceservice model-v1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \ -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \ -d '{"data": "test"}'
Deploy v2
-
Deploy the v2 inference service.
arena serve kserve \ --name=model-v2 \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-canary:1.0.0 \ --cpu=1 \ --memory=2Gi \ "python app-v2.py --model_name=canary" -
Create a file named
model-v2-svc.yamlwith the following content.apiVersion: v1 kind: Service metadata: name: model-v2-svc spec: ports: - port: 80 protocol: TCP targetPort: 8080 selector: serving.kserve.io/inferenceservice: model-v2 type: ClusterIP -
Create the Service.
kubectl apply -f model-v2-svc.yaml -
Verify that model-v2 is running correctly.
curl -H "Host: $(kubectl get inferenceservice model-v2 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \ -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \ -d '{"data": "test"}'
Step 2: Create an Ingress
Create a base Ingress that routes all traffic to the v1 Service by default.
-
Create a file named
model-ingress.yamlwith the following content.apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: model-ingress spec: rules: - host: model.example.com # Replace with your hostname. http: paths: - path: / backend: service: name: model-svc # The Service for v1. port: number: 80 pathType: ImplementationSpecific -
Create the Ingress.
kubectl apply -f model-ingress.yaml
Step 3: Create and verify canary release policies
Select one of the following strategies depending on how you want to control which traffic reaches v2.
Scenario 1: Header-based traffic splitting
Route requests with a specific request header to v2. All other requests go to v1 by default. Use this strategy for targeted testing with a specific client or team.
-
Create a file named
gray-release-canary.yamlwith the following content.apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: gray-release-canary annotations: nginx.ingress.kubernetes.io/canary: "true" nginx.ingress.kubernetes.io/canary-by-header: "foo" # Route by the "foo" request header. nginx.ingress.kubernetes.io/canary-by-header-value: "bar" # Requests with foo: bar go to v2. spec: rules: - host: model.example.com http: paths: - path: / backend: service: name: model-v2-svc # The Service for v2. port: number: 80 pathType: ImplementationSpecific -
Deploy the canary release policy.
kubectl apply -f gray-release-canary.yaml -
Verify that requests without the canary header go to v1.
# Replace the hostname with the one specified in your Ingress. curl -H "Host: model.example.com" -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \ -d '{"data": "test"}'Expected output:
{"id":"4d8c110d-c291-4670-ad0a-1a30bf8e314c","model_name":"canary","model_version":null,"outputs":[{"name":"output-0","shape":[1,1],"datatype":"STR","data":["model-v1"]}]}The response from model-v1 confirms that default traffic continues to reach v1.
-
Verify that requests with
foo: bargo to v2.curl -H "Host: model.example.com" -H "Content-Type: application/json" \ -H "foo: bar" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \ -d '{"data": "test"}'Expected output:
{"id":"4d3efc12-c8bd-40f8-898f-7983377db7bd","model_name":"canary","model_version":null,"outputs":[{"name":"output-0","shape":[1,1],"datatype":"STR","data":["model-v2"]}]}The response from model-v2 confirms that the canary release policy is working.
Scenario 2: Weight-based traffic splitting
Route a fixed percentage of traffic to v2 regardless of request headers. Use this strategy for broad canary validation across all users.
-
Create a file named
gray-release-canary.yamlwith the following content.apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: gray-release-canary annotations: nginx.ingress.kubernetes.io/canary: "true" nginx.ingress.kubernetes.io/canary-weight: "20" # Route 20% of traffic to v2. The default total weight is 100. spec: rules: - host: model.example.com http: paths: - path: / backend: service: name: model-v2-svc # The Service for v2. port: number: 80 pathType: ImplementationSpecific -
Deploy the canary release policy.
kubectl apply -f gray-release-canary.yaml -
Verify traffic distribution by sending multiple requests.
curl -H "Host: model.example.com" -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \ -d '{"data": "test"}'Run the command multiple times. About 20% of responses will show
"data":["model-v2"]and the remaining 80% will show"data":["model-v1"], confirming that the canary release policy is working.
Step 4: Switch traffic to the new version
After v2 has been running as expected, redirect all traffic to v2 and remove v1.
-
Update
model-svc.yamlto point themodel-svcService to v2.apiVersion: v1 kind: Service metadata: name: model-svc spec: ports: - port: 80 protocol: TCP targetPort: 8080 selector: serving.kserve.io/inferenceservice: model-v2 # Changed from model-v1 to model-v2. type: ClusterIP -
Apply the updated Service.
kubectl apply -f model-svc.yaml -
Verify that all traffic now reaches v2.
curl -H "Host: model.example.com" -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/models/canary:predict -X POST \ -d '{"data": "test"}'Expected output:
{"id":"a13f2089-73ce-41e3-989e-e58457d14fed","model_name":"canary","model_version":null,"outputs":[{"name":"output-0","shape":[1,1],"datatype":"STR","data":["model-v2"]}]}Run the command multiple times to confirm that all responses come from model-v2.
-
Delete the canary Ingress and v1 resources.
kubectl delete ingress gray-release-canary arena serve delete model-v1 kubectl delete svc model-v2-svc