This topic describes how to enable auto scaling of pods based on the number of requests. - Container Service for Kubernetes

Knative Pod Autoscaler (KPA) is an out-of-the-box feature that can scale pods based on the number of requests. This topic describes how to configure auto scaling based on the number of requests.

Prerequisites

An ACK managed cluster or ACK Serverless cluster is created and Kubernetes version of the cluster is 1.20 or later. For more information, see Create an ACK managed cluster and Create an ACK Serverless cluster.

How it works

Knative Serving injects a Queue Proxy container named queue-proxy into each pod. The container automatically reports the request concurrency metrics of application pods to KPA. After KPA receives the metrics, KPA automatically adjusts the number of pods provisioned for a Deployment based on the number of concurrent requests and the related algorithm.

Algorithm

KPA scales pods based on the average number of requests (or concurrent requests) received by each pod. By default, KPA scales pods based on the number of concurrent requests. Each pod can handle at most 100 requests concurrently. KPA also introduces the target utilization (target-utilization-percentage) annotation, which specifies a target utilization percentage value for auto scaling.

The following formula is used to calculate the target number of pods based on the number of concurrent requests: Number of pods = Number of concurrent requests/(Pod maximum concurrency × Target utilization).

For example, the pod maximum concurrency of an application is set to 10 and the target utilization is set to 0.7. When 100 concurrent requests are received, KPA creates 15 pods based on the following formula: 100/(0.7 × 10).

KPA also supports the stable and panic modes to perform fine-grained auto scaling.

Stable mode
In stable mode, KPA counts the average number of concurrent requests across pods within the stable window. The default stable window is 60 seconds. Then, KPA adjusts the number of pods based on the average concurrency value to maintain the loads at a stable level.
Panic mode
In panic mode, KPA counts the average number of concurrent requests across pods within the panic window. The default panic window is 6 seconds. The panic window is calculated based on the following formula: Panic window = Stable window × panic-window-percentage. The value of panic-window-percentage is greater than 0 and smaller than 1 and the default is 0.1. When a request burst occurs and the current pod utilization exceeds the panic window, KPA increases the number of pods to handle the burst.

KPA makes scaling decisions by checking whether the number of pods calculated in panic mode exceeds the panic threshold. The panic threshold is calculated based on the following formula: Panic threshold = panic-threshold-percentage/100. The default value of panic-threshold-percentage is 200. Therefore, the default panic threshold is 2.

If the number of pods calculated in panic mode is greater than or equal to twice the current number of ready pods, KPA scales the application to the number of pods calculated in panic mode. Otherwise, KPA scales the application to the number of pods calculated in stable mode.

KPA configurations

config-autoscaler

To configure KPA, you must configure config-autoscaler. By default, config-autoscaler is configured. The following content describes the key parameters.

Run the following command to query config-autoscaler:

kubectl -n knative-serving describe cm config-autoscaler

Expected output (the default config-autoscaler ConfigMap):

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
# The default maximum concurrency of pods. The default value is 100. 
 container-concurrency-target-default: "100"
# The target utilization for concurrency. The default value is 70, which represents 0.7.  
 container-concurrency-target-percentage: "70"
# The default requests per second (RPS). The default value is 200. 
 requests-per-second-target-default: "200"
# The target burst capacity parameter is used to handle traffic bursts and prevent pod overloading. The default value is 211. 
# The Activator service is used to receive and buffer requests when the target burst capacity is exceeded. 
# If the target burst capacity parameter is set to 0, the Activator service is placed in the request path only when the number of pods is scaled to zero. 
# If the target burst capacity parameter is set to a value greater than 0 and container-concurrency-target-percentage is set to 100, the Activator service is always used to receive and buffer requests. 
# If the target burst capacity parameter is set to -1, the burst capacity is unlimited. All requests are buffered by the Activator service. If you set the target burst capacity parameter to other values, the parameter does not take effect. 
# If the value of current number of ready pods × maximum concurrency - target burst capacity - concurrency calculated in panic mode is smaller than 0, the traffic burst exceeds the target burst capacity. In this scenario, the Activator service is placed to buffer requests. 
 target-burst-capacity: "211"
# The stable window. The default is 60 seconds. 
 stable-window: "60s"
# The panic window percentage. The default is 10, which indicates that the default panic window is 6 seconds (60 × 0.1). 
 panic-window-percentage: "10.0"
# The panic threshold percentage. The default is 200. 
 panic-threshold-percentage: "200.0"
# The maximum scale up rate, which indicates the maximum ratio of desired pods per scale-out activity. The value is calculated based on the following formula: math.Ceil(MaxScaleUpRate*readyPodsCount). 
 max-scale-up-rate: "1000.0"
# The maximum scale down rate. The default is 2, which indicates that pods are scaled to half of the current number during each scale-in activity. 
 max-scale-down-rate: "2.0"
# Specifies whether to scale the number of pods to zero. By default, this feature is enabled. 
 enable-scale-to-zero: "true"
# The graceful period before the number of pods is scaled to zero. The default is 30 seconds.  
 scale-to-zero-grace-period: "30s"
# The retention period of the last pod before the number of pods is scaled to zero. Specify this parameter if the cost for launching pods is high.
 scale-to-zero-pod-retention-period: "0s"
# The type of the autoscaler. The following autoscalers are supported: KPA, HPA, AHA, and MPA. You can use MPA with Microservices Engine (MSE) in ACK Serverless clusters to scale the number of pods to zero. 
 pod-autoscaler-class: "kpa.autoscaling.knative.dev"
# The request capacity of the Activator service. 
 activator-capacity: "100.0"
# The number of pods to be initialized when a revision is created. The default is 1. 
 initial-scale: "1"
# Specifies whether no pod is initialized when a revision is created. The default is false, which indicates that pods are initialized when a revision is created. 
 allow-zero-initial-scale: "false"
# The minimum number of pods kept for a revision. The default is 0, which means that no pod is kept. 
 min-scale: "0"
# The maximum number of pods to which a revision can be scaled. The default is 0, which means that the number of pods to which a revision can be scaled is unlimited. 
 max-scale: "0"
# The scale down delay. The default is 0, which indicates a scale-in activity is performed immediately. 
 scale-down-delay: "0s"

Metrics

You can use the autoscaling.knative.dev/metric annotation to configure metrics for a revision. Different autoscalers support different metrics.

Supported metrics: "concurrency", "rps", "cpu", "memory", and custom metrics.
Default metric: "concurrency".

Configure the concurrency metric

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency"

Configure the RPS metric

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "rps"

Configure the CPU metric

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
        autoscaling.knative.dev/metric: "cpu"

Configure the memory metric

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
        autoscaling.knative.dev/metric: "memory"

Configure a target

You can use the autoscaling.knative.dev/target annotation to configure a target for a revision. You can also use the container-concurrency-target-default annotation to configure a global target in a ConfigMap.

Configure a target for a revision

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target: "50"

Configure a global target

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 container-concurrency-target-default: "200"

Configure scale-to-zero

Configure global scale-to-zero

The enable-scale-to-zero parameter specifies whether to scale the number of pods to zero when the specified Knative Service is idle. Valid values are "false" and "true".

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 enable-scale-to-zero: "false" # If the parameter is set to "false", the scale-to-zero feature is disabled. In this case, when the specified Knative Service is idle, pods are not scaled to zero.

Configure the graceful period for scale-to-zero

The scale-to-zero-grace-period parameter specifies the graceful period before the pods of the specified Knative Service are scaled to zero.

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 scale-to-zero-grace-period: "40s"

Configure the retention period for scale-to-zero

Configure the retention period for a revision

The autoscaling.knative.dev/scale-to-zero-pod-retention-period annotation specifies the retention period of the last pod before the pods of a Knative Service are scaled to zero.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "1m5s"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56

Configure the global retention period
The scale-to-zero-pod-retention-period annotation specifies the global retention period of the last pod before the pods of a Knative Service are scaled to zero.
```
apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 scale-to-zero-pod-retention-period: "42s"
```

Configure the concurrency

The concurrency indicates the maximum number of requests that a pod can process concurrently. You can configure the concurrency by setting the soft concurrency limit, hard concurrency limit, target utilization, and RPS.

Configure the soft concurrency limit

The soft concurrency limit is a targeted limit rather than a strictly enforced bound. In some scenarios, particularly if a burst of requests occurs, the value may be exceeded.

Configure the soft concurrency limit for a revision

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target: "200"

Configure the global soft concurrency limit

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 container-concurrency-target-default: "200" # Specify a concurrency target for a Knative Service.

Configure the hard concurrency limit for a revision

Important

We recommend that you specify the hard concurrency limit only if your application has a specific concurrency upper limit. Setting a low hard concurrency limit adversely affects the throughput and response latency of your application.

The hard concurrency limit is a strictly enforced limit. When the hard concurrency limit is reached, excess requests are buffered until sufficient resources can be used to handle the requests.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    spec:
      containerConcurrency: 50

Target utilization

The target utilization specifies the actual percentage of the target of the autoscaler. You can also use the target utilization to adjust the concurrency value. The target utilization is also known as the hotness at which a pod runs. This causes the autoscaler to scale out before the specified hard concurrency limit is reached.

For example, if containerConcurrency is set to 10 and the target utilization is set to 70 (percentage), the autoscaler creates a pod when the average number of concurrent requests across all existing pods reaches 7. It takes a period of time for a pod to enter the Ready state after the pod is created. You can decrease the target utilization value to create new pods before the hard concurrency limit is reached. This helps reduce the response latency caused by cold starts.

Configure the target utilization for a revision

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target-utilization-percentage: "70" # Configure the target utilization percentage. 
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56

Configure the global target utilization

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 container-concurrency-target-percentage: "70" # KPA attempts to ensure that the concurrency of each pod does not exceed 70% of the current resources.

Configure the RPS

The RPS specifies the number of requests that can be processed by a pod per second.

Configure the RPS for a revision

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target: "150"
        autoscaling.knative.dev/metric: "rps" # The number of pods is adjusted based on the RPS value. 
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56

Configure the global RPS

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 requests-per-second-target-default: "150"

Scenario 1: Enable auto scaling by setting a concurrency target

This example shows how to enable KPA to perform auto scaling by setting a concurrency target.

For more information about how to deploy Knative, see Deploy Knative in an ACK cluster and Deploy Knative in an ACK Serverless cluster.

Create a file named autoscale-go.yaml and add the following content to the file:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: autoscale-go
  namespace: default
spec:
  template:
    metadata:
      labels:
        app: autoscale-go
      annotations:
        autoscaling.knative.dev/target: "10" # Set the concurrency target to 10. 
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/autoscale-go:0.1

Run the following command to deploy the autoscale-go.yaml file:
```
kubectl apply -f autoscale-go.yaml
```

Obtain the Ingress gateway.

ALB

Run the following command to obtain the Ingress gateway:

kubectl get albconfig knative-internet

Expected output:

NAME               ALBID                    DNSNAME                                              PORT&PROTOCOL   CERTID   AGE
knative-internet   alb-hvd8nngl0lsdra15g0   alb-hvd8nng******.cn-beijing.alb.aliyuncs.com                            2

MSE

Run the following command to obtain the Ingress gateway:

kubectl -n knative-serving get ing stats-ingress

Expected output:

NAME            CLASS                  HOSTS   ADDRESS                         PORTS   AGE
stats-ingress   knative-ingressclass   *       101.201.XX.XX,192.168.XX.XX   80      15d

ASM

Run the following command to obtain the Ingress gateway:

kubectl get svc istio-ingressgateway --namespace istio-system --output jsonpath="{.status.loadBalancer.ingress[*]['ip']}"

Expected output:

121.XX.XX.XX

Kourier

Run the following command to obtain the Ingress gateway:

kubectl -n knative-serving get svc kourier

Expected output:

NAME      TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)                      AGE
kourier   LoadBalancer   10.0.XX.XX    39.104.XX.XX     80:31133/TCP,443:32515/TCP   49m

Use the load testing tool hey to send 50 concurrent requests to the application within 30 seconds.
Note
For more information about hey, see hey.
```
hey -z 30s -c 50   -host "autoscale-go.default.example.com"   "http://121.199.XXX.XXX?sleep=100&prime=10000&bloat=5" # 121.199.XXX.XXX is the IP address of the Ingress gateway.
```
Expected output:
The output indicates that five pods are added.

Scenario 2: Enable auto scaling by setting scale bounds

Scale bounds control the minimum and maximum numbers of pods that can be provisioned for an application. This example shows how to enable auto scaling by setting scale bounds.

For more information about how to deploy Knative, see Deploy Knative in an ACK cluster and Deploy Knative in an ACK Serverless cluster.

Create a file named autoscale-go.yaml and add the following content to the file:

Set the concurrency target to 10, min-scale to 1, and max-scale to 3.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: autoscale-go
  namespace: default
spec:
  template:
    metadata:
      labels:
        app: autoscale-go
      annotations:
        autoscaling.knative.dev/target: "10"
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/max-scale: "3"   
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/autoscale-go:0.1

Run the following command to deploy the autoscale-go.yaml file:
```
kubectl apply -f autoscale-go.yaml
```

Obtain the Ingress gateway.

ALB

Run the following command to obtain the Ingress gateway:

kubectl get albconfig knative-internet

Expected output:

NAME               ALBID                    DNSNAME                                              PORT&PROTOCOL   CERTID   AGE
knative-internet   alb-hvd8nngl0lsdra15g0   alb-hvd8nng******.cn-beijing.alb.aliyuncs.com                            2

MSE

Run the following command to obtain the Ingress gateway:

kubectl -n knative-serving get ing stats-ingress

Expected output:

NAME            CLASS                  HOSTS   ADDRESS                         PORTS   AGE
stats-ingress   knative-ingressclass   *       101.201.XX.XX,192.168.XX.XX   80      15d

ASM

Run the following command to obtain the Ingress gateway:

kubectl get svc istio-ingressgateway --namespace istio-system --output jsonpath="{.status.loadBalancer.ingress[*]['ip']}"

Expected output:

121.XX.XX.XX

Kourier

Run the following command to obtain the Ingress gateway:

kubectl -n knative-serving get svc kourier

Expected output:

NAME      TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)                      AGE
kourier   LoadBalancer   10.0.XX.XX    39.104.XX.XX     80:31133/TCP,443:32515/TCP   49m

Use the load testing tool hey to send 50 concurrent requests to the application within 30 seconds.
Note
For more information about hey, see hey.
```
hey -z 30s -c 50   -host "autoscale-go.default.example.com"   "http://121.199.XXX.XXX?sleep=100&prime=10000&bloat=5" # # 121.199.XXX.XXX is the IP address of the Ingress gateway.
```
Expected output:
At most three pods are added. One pod is reserved when no traffic flows to the application.