Deploy a Stable Diffusion service with Knative - Container Service for Kubernetes

Stable Diffusion can generate high-quality images quickly, but running it in production has two hard problems: a single pod can only handle one request at a time, and leaving GPU nodes idle between requests is expensive. This guide shows how to use Knative on Container Service for Kubernetes (ACK) to solve both — concurrency-based autoscaling prevents pod overload, and scale-to-zero eliminates idle GPU costs.

Prerequisites

Before you begin, ensure that you have:

An ACK cluster running Kubernetes 1.24 or later with GPU-accelerated nodes. Use one of these instance types: ecs.gn5-c4g1.xlarge, ecs.gn5i-c8g1.2xlarge, or ecs.gn5-c8g1.2xlarge. For setup instructions, see Create an ACK managed cluster.
Knative deployed in the cluster. See Deploy and manage Knative.
The hey load testing tool installed: go install github.com/rakyll/hey@latest. Used in Step 3 to verify autoscaling.

How it works

The key configuration is containerConcurrency: 1, which tells Knative that each pod handles exactly one request at a time. When 5 concurrent requests arrive, Knative scales to 5 pods (5 concurrent requests ÷ 1 per pod = 5 pods). When traffic drops to zero, Knative terminates all pods to stop GPU billing.

Important

Follow the user agreements, usage specifications, and applicable laws and regulations of the third-party model Stable Diffusion. Alibaba Cloud does not guarantee the legitimacy, security, or accuracy of Stable Diffusion and is not liable for any damages from its use.

Step 1: Deploy the Stable Diffusion service

Important

Deploy the Stable Diffusion service on a GPU-accelerated node. The service cannot start without GPU access.

Log on to the ACK console and click ACK consoleClusters in the left navigation pane.
Click the target cluster name, then choose Applications > Knative in the left navigation pane.

Deploy the service using the application template or a YAML file.

Application template

On the Popular Apps tab, click Quick Deployment on the stable-diffusion card.

After deployment, click Services and confirm the Status is Created.

YAML

On the Services tab, select default from the Namespace drop-down list and click Create from Template. Paste the following YAML into the editor and click Create.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: stable-diffusion
  annotations:
    serving.knative.dev.alibabacloud/affinity: "cookie"
    serving.knative.dev.alibabacloud/cookie-name: "sd"
    serving.knative.dev.alibabacloud/cookie-timeout: "1800"
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/maxScale: '10'
        autoscaling.knative.dev/targetUtilizationPercentage: "100"
        k8s.aliyun.com/eci-use-specs: ecs.gn5-c4g1.xlarge,ecs.gn5i-c8g1.2xlarge,ecs.gn5-c8g1.2xlarge
    spec:
      containerConcurrency: 1
      containers:
      - args:
        - --listen
        - --skip-torch-cuda-test
        - --api
        command:
        - python3
        - launch.py
        image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion@sha256:62b3228f4b02d9e89e221abe6f1731498a894b042925ab8d4326a571b3e992bc
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 7860
          name: http1
          protocol: TCP
        name: stable-diffusion
        readinessProbe:
          tcpSocket:
            port: 7860
          initialDelaySeconds: 5
          periodSeconds: 1
          failureThreshold: 3

The service is ready when it shows the following status:

Key autoscaling parameters

Parameter	Value in this guide	What it controls
`containerConcurrency`	`1`	Maximum concurrent requests per pod. `1` means Stable Diffusion's single-threaded inference is not shared between requests.
`autoscaling.knative.dev/maxScale`	`10`	Upper bound on pod count. Prevents unbounded GPU spend.
`autoscaling.knative.dev/targetUtilizationPercentage`	`100`	Scale-out threshold as a percentage of `containerConcurrency`. At `100`, a new pod is added only when the current pod is fully utilized.

Warning

containerConcurrency: 1 is a hard concurrency limit. While appropriate for Stable Diffusion's single-threaded inference, a low value on other services reduces throughput and increases cold-start frequency. Choose a value that matches your application's actual concurrency capacity.

Step 2: Access the Stable Diffusion service

On the Services tab, record the gateway IP address and default domain name of the service. If you use ALB Ingress, access the service with this format:
```
curl -H "Host: stable-diffusion.default.example.com" http://alb-XXX.cn-hangzhou.alb.aliyuncsslb.com # Replace with your ALB Ingress address.
```
For direct domain access, configure a CNAME record for your ALB instance.

Add an entry to your local hosts file to map the gateway IP to the domain:

47.xx.xxx.xx stable-diffusion.default.example.com # Replace with your actual gateway IP address.

On the Services tab, click the default domain name of the stable-diffusion service. A successful connection shows the Stable Diffusion web UI:

Step 3: Verify autoscaling

Use hey to send 50 requests with 5 concurrent connections, then watch pod count change in real time:

hey -n 50 -c 5 -t 180 -m POST -H "Content-Type: application/json" \
  -d '{"prompt": "pretty dog"}' \
  http://stable-diffusion.default.example.com/sdapi/v1/txt2img

In a separate terminal, watch the pods:

watch -n 1 'kubectl get po'

Because containerConcurrency: 1 limits each pod to one request, Knative scales to 5 pods to handle 5 concurrent connections (5 concurrent requests ÷ 1 per pod = 5 pods):

The hey output confirms all 50 requests complete successfully:

Summary:
  Total:	252.1749 secs
  Slowest:	62.4155 secs
  Fastest:	9.9399 secs
  Average:	23.9748 secs
  Requests/sec:	0.1983


Response time histogram:
  9.940 [1]	|■■
  15.187 [17]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  20.435 [9]	|■■■■■■■■■■■■■■■■■■■■■
  25.683 [11]	|■■■■■■■■■■■■■■■■■■■■■■■■■■
  30.930 [1]	|■■
  36.178 [1]	|■■
  41.425 [3]	|■■■■■■■
  46.673 [1]	|■■
  51.920 [2]	|■■■■■
  57.168 [1]	|■■
  62.415 [3]	|■■■■■■■


Latency distribution:
  10% in 10.4695 secs
  25% in 14.8245 secs
  50% in 20.0772 secs
  75% in 30.5207 secs
  90% in 50.7006 secs
  95% in 61.5010 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0424 secs, 9.9399 secs, 62.4155 secs
  DNS-lookup:	0.0385 secs, 0.0000 secs, 0.3855 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0004 secs
  resp wait:	23.8850 secs, 9.9089 secs, 62.3562 secs
  resp read:	0.0471 secs, 0.0166 secs, 0.1834 secs

Status code distribution:
  [200]	50 responses

For more information about hey, see hey.

Step 4: View monitoring data

Knative provides built-in observability. View metrics for the Stable Diffusion service on the Monitoring Dashboards page under Knative in the ACK console. To enable dashboards, see View the Knative monitoring dashboard.

What's next

Best practices for deploying AI inference services in Knative