Implement global throttling via Gateway with Inference Extension - Container Compute Service

Gateway with Inference Extension supports global throttling for clusters to ensure system stability under high concurrency or abnormal traffic conditions. This topic describes how to configure and use the global throttling capabilities of Gateway with Inference Extension.

How it works

Throttling is a mechanism that controls the number of requests sent to a server. It specifies the maximum number of requests a client can make in a given time period, such as 300 requests per minute or 10 requests per second.

When you enable global throttling, Gateway with Inference Extension automatically deploys a centralized throttling service. This service manages and enforces global throttling policies in real time. Gateway with Inference Extension interacts with this throttling service through the built-in throttling filters (such as rate limit) to get real-time preset throttling thresholds (such as requests or concurrent connections per second), and rate-limits incoming requests based on these policies.

Prerequisites

You have installed Gateway with Inference Extension version 1.4.0 or later.
You have completed the steps in Preparations.

Procedure

Step 1: Enable global throttling

The automatically deployed global throttling service requires a Redis service as global storage. This topic uses a self-built Redis service as an example. You can also use Tair (Redis OSS-compatible) to quickly create a Redis instance and update the related configurations to ack-gateway-config in the envoy-gateway-system namespace. For more information, see Envoy Gateway.

Create a file named redis-service.yaml to deploy Redis.

kind: Namespace
apiVersion: v1
metadata:
  name: redis-system
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
  namespace: redis-system
  labels:
    app: redis
spec:
  serviceName: "redis"
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/redis:6.0.6-for-ack-gateway
          name: redis
          ports:
            - containerPort: 6379
          resources:
            limits:
              cpu: 1500m
              memory: 512Mi
            requests:
              cpu: 200m
              memory: 256Mi
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  namespace: redis-system
  labels:
    app: redis
spec:
  ports:
    - name: redis
      port: 6379
      protocol: TCP
      targetPort: 6379
  selector:
    app: redis

Create a file named enable-global-rate-limit.yaml. This ConfigMap configures the gateway to use your Redis instance.

apiVersion: v1
kind: ConfigMap
metadata:
  name: ack-gateway-config
  namespace: envoy-gateway-system
data:
  ack-gateway.yaml: |
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: EnvoyGateway
    rateLimit:
      backend:
        type: Redis
        redis:
          url: redis.redis-system.svc.cluster.local:6379

Deploy the Redis service and enable global throttling.

kubectl apply -f redis-service.yaml
kubectl apply -f enable-global-rate-limit.yaml

Step 2: Deploy a sample HTTPRoute

Create an HTTPRoute resource that will be the target for your throttling policy.

Create a file named httproute.yaml.

---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-ratelimit
spec:
  parentRefs:
  - name: eg
  hostnames:
  - ratelimit.example 
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - group: ""
      kind: Service
      name: backend
      port: 3000

Deploy the HTTPRoute resource.
```
kubectl apply -f httproute.yaml
```

Get the public IP address of the gateway.

export GATEWAY_HOST=$(kubectl get gateway/eg -o jsonpath='{.status.addresses[0].value}')

Step 3: Configure and test throttling scenarios

Scenario 1: Throttling by user ID

In this scenario, we will limit requests from a specific user (identified by the header x-user-id: one) to 3 requests per hour.

Create a BackendTrafficPolicy resource in a file named backendtrafficpolicy.yaml.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy 
metadata:
  name: policy-httproute
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: http-ratelimit
  rateLimit:
    type: Global
    global:
      rules:
      - clientSelectors:
        - headers:
          - name: x-user-id
            value: one
        limit:
          requests: 3
          unit: Hour

Apply the throttling policy backendtrafficpolicy.
```
kubectl apply -f backendtrafficpolicy.yaml
```

Test the throttling by sending four requests with the x-user-id: one header.

for i in {1..4}; do kubectl exec deployment/sleep -it -- curl -I --header "Host: ratelimit.example" --header "x-user-id: one" http://$GATEWAY_HOST/get ; sleep 1; done

Expected output:

HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 07:47:49 GMT                                                                                                                                        
content-length: 504                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 2                                                                                                                                                   
x-ratelimit-reset: 731                                                                                                                                                     
 
HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 07:47:50 GMT                                                                                                                                        
content-length: 504                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 1                                                                                                                                                   
x-ratelimit-reset: 730                                                                                                                                                     
 
HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 07:47:52 GMT                                                                                                                                        
content-length: 504                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 0                                                                                                                                                   
x-ratelimit-reset: 728                                                                                                                                                     
 
HTTP/1.1 429 Too Many Requests 
x-envoy-ratelimited: true                                                                                                                                                  
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 0                                                                                                                                                   
x-ratelimit-reset: 727                                                                                                                                                     
date: Tue, 27 May 2025 07:47:52 GMT                                                                                                                                        
transfer-encoding: chunked

The first three requests succeeded (HTTP 200), and the fourth request was rejected (HTTP 429).

Test the throttling by sending four requests with the x-user-id: two header.

for i in {1..4}; do kubectl exec deployment/sleep -it -- curl -I --header "Host: ratelimit.example" --header "x-user-id: two" http://$GATEWAY_HOST/get ; sleep 1; done

Expected output:

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:50:11 GMT
content-length: 504

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:50:12 GMT
content-length: 504

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:50:14 GMT
content-length: 504

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:50:15 GMT
content-length: 504

All four requests returned 200, indicating that the throttling policy did not limit requests with the x-user-id: two header.

Scenario 2: Throttling all users except administrators

In this scenario, we will limit every unique user to 3 requests per hour, except for the user admin.

Edit the throttling policy.

kubectl edit BackendTrafficPolicy policy-httproute

Update the rateLimit section as follows, where the invert: true flag excludes the admin user from the policy:

...
  rateLimit:
    type: Global
    global:
      rules:
      - clientSelectors:
        - headers:
          - type: Distinct
            name: x-user-id
          - name: x-user-id
            value: admin
            invert: true
        limit:
          requests: 3
          unit: Hour

After saving and exiting, the throttling policy takes effect immediately.

Test the throttling by sending four requests with the x-user-id: one header.

for i in {1..4}; do kubectl exec deployment/sleep -it -- curl -I --header "Host: ratelimit.example" --header "x-user-id: one" http://$GATEWAY_HOST/get ; sleep 1; done

Expected output:

HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 07:47:49 GMT                                                                                                                                        
content-length: 504                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 2                                                                                                                                                   
x-ratelimit-reset: 731                                                                                                                                                     
 
HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 07:47:50 GMT                                                                                                                                        
content-length: 504                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 1                                                                                                                                                   
x-ratelimit-reset: 730                                                                                                                                                     
 
HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 07:47:52 GMT                                                                                                                                        
content-length: 504                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 0                                                                                                                                                   
x-ratelimit-reset: 728                                                                                                                                                     
 
HTTP/1.1 429 Too Many Requests 
x-envoy-ratelimited: true                                                                                                                                                  
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 0                                                                                                                                                   
x-ratelimit-reset: 727                                                                                                                                                     
date: Tue, 27 May 2025 07:47:52 GMT                                                                                                                                        
transfer-encoding: chunked

The first three requests succeeded (HTTP 200), and the fourth request was rejected (HTTP 429).

Test the throttling by sending four requests with the x-user-id: two header.

for i in {1..4}; do kubectl exec deployment/sleep -it -- curl -I --header "Host: ratelimit.example" --header "x-user-id: two" http://$GATEWAY_HOST/get ; sleep 1; done

Expected output:

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:53:38 GMT
content-length: 504
x-ratelimit-limit: 3, 3;w=3600
x-ratelimit-remaining: 2
x-ratelimit-reset: 382

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:53:39 GMT
content-length: 504
x-ratelimit-limit: 3, 3;w=3600
x-ratelimit-remaining: 1
x-ratelimit-reset: 381

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:53:41 GMT
content-length: 504
x-ratelimit-limit: 3, 3;w=3600
x-ratelimit-remaining: 0
x-ratelimit-reset: 379

HTTP/1.1 429 Too Many Requests
x-envoy-ratelimited: true
x-ratelimit-limit: 3, 3;w=3600
x-ratelimit-remaining: 0
x-ratelimit-reset: 378
date: Tue, 27 May 2025 07:53:41 GMT
transfer-encoding: chunked

The first three requests succeeded (HTTP 200), and the fourth request was rejected (HTTP 429).

Test the throttling by sending four requests with the x-user-id: admin header.

for i in {1..4}; do kubectl exec deployment/sleep -it -- curl -I --header "Host: ratelimit.example" --header "x-user-id: admin" http://$GATEWAY_HOST/get ; sleep 1; done

Expected output:

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:57:44 GMT
content-length: 506

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:57:45 GMT
content-length: 506

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:57:46 GMT
content-length: 506

HTTP/1.1 200 OK
content-type: application/json
x-content-type-options: nosniff
date: Tue, 27 May 2025 07:57:47 GMT
content-length: 506

All four requests returned 200, indicating that the throttling policy did not limit requests with the x-user-id: admin header.

Scenario 3: Throttling all requests

In this scenario, we will limit all requests to 3 requests per hour.

Edit the throttling policy.

kubectl edit BackendTrafficPolicy policy-httproute

Update the rateLimit section with the following:

...
  rateLimit:
    type: Global
    global:
      rules:
      - limit:
          requests: 3
          unit: Hour

After saving and exiting, the throttling policy takes effect immediately.

Test the throttling by sending four requests.

for i in {1..4}; do kubectl exec deployment/sleep -it -- curl -I --header "Host: ratelimit.example" http://$GATEWAY_HOST/get ; sleep 1; done

Expected output:

HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 08:02:53 GMT                                                                                                                                        
content-length: 473                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 2                                                                                                                                                   
x-ratelimit-reset: 3427                                                                                                                                                    
 
HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 08:02:55 GMT                                                                                                                                        
content-length: 473                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 1                                                                                                                                                   
x-ratelimit-reset: 3425                                                                                                                                                    
 
HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 08:02:56 GMT                                                                                                                                        
content-length: 473                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 0                                                                                                                                                   
x-ratelimit-reset: 3424                                                                                                                                                    
 
HTTP/1.1 429 Too Many Requests 
x-envoy-ratelimited: true                                                                                                                                                  
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 0                                                                                                                                                   
x-ratelimit-reset: 3423                                                                                                                                                    
date: Tue, 27 May 2025 08:02:57 GMT                                                                                                                                        
transfer-encoding: chunked

The first three requests succeeded (HTTP 200), and the fourth request was rejected (HTTP 429), indicating that the throttling policy has taken effect.

Scenario 4: Throttling by client IP address

In this scenario, we will limit each unique source IP address within an IP range to 3 requests per hour.

Note

The IP range 0.0.0.0/0 is used for demonstration purposes. Modify this setting as needed.

Edit the throttling policy.

kubectl edit BackendTrafficPolicy policy-httproute

Update the rateLimit section with the following:

...
  rateLimit:
    type: Global
    global:
      rules:
      - clientSelectors:
        - sourceCIDR: 
            value: 0.0.0.0/0
            type: Distinct
        limit:
          requests: 3
          unit: Hour

After saving and exiting, the throttling rule takes effect immediately.

Test the throttling by sending four requests.

for i in {1..4}; do kubectl exec deployment/sleep -it -- curl -I --header "Host: ratelimit.example" http://$GATEWAY_HOST/get ; sleep 1; done

Expected output:

HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 08:02:53 GMT                                                                                                                                        
content-length: 473                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 2                                                                                                                                                   
x-ratelimit-reset: 3427                                                                                                                                                    
 
HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 08:02:55 GMT                                                                                                                                        
content-length: 473                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 1                                                                                                                                                   
x-ratelimit-reset: 3425                                                                                                                                                    
 
HTTP/1.1 200 OK 
content-type: application/json                                                                                                                                             
x-content-type-options: nosniff                                                                                                                                            
date: Tue, 27 May 2025 08:02:56 GMT                                                                                                                                        
content-length: 473                                                                                                                                                        
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 0                                                                                                                                                   
x-ratelimit-reset: 3424                                                                                                                                                    
 
HTTP/1.1 429 Too Many Requests 
x-envoy-ratelimited: true                                                                                                                                                  
x-ratelimit-limit: 3, 3;w=3600                                                                                                                                             
x-ratelimit-remaining: 0                                                                                                                                                   
x-ratelimit-reset: 3423                                                                                                                                                    
date: Tue, 27 May 2025 08:02:57 GMT                                                                                                                                        
transfer-encoding: chunked

The first three requests succeeded (HTTP 200), and the fourth request was rejected (HTTP 429), indicating that the throttling policy for 0.0.0.0/0 has taken effect.

(Optional) Step 4: Clean up resources

Clean up the throttling policy.

kubectl delete BackendTrafficPolicy policy-httproute

Clean up other resources created in this topic.

kubectl delete -f httproute.yaml
kubectl delete -f redis-service.yaml
kubectl delete -f enable-global-rate-limit.yaml