Automatically scale applications using Mixerless Telemetry - Alibaba Cloud Service Mesh

When application traffic fluctuates, manually adjusting pod replicas is slow and error-prone. Mixerless Telemetry in Alibaba Cloud Service Mesh (ASM) collects telemetry data from Istio sidecar proxies without code changes. By feeding these metrics into a Horizontal Pod Autoscaler (HPA), pods scale automatically based on real-time traffic signals -- request rate, average latency, or P95 latency.

How it works

Autoscaling with Mixerless Telemetry follows this data flow:

Istio Sidecar Proxy --> Prometheus --> Metrics Adapter --> HPA --> Pod Scale Out/In
  (collects metrics)     (stores &       (bridges K8s        (evaluates    (adjusts
                          queries)        Metrics API)        thresholds)   replicas)

Istio sidecar proxies generate telemetry metrics (istio_requests_total, istio_request_duration_milliseconds_*) for every request.
Prometheus scrapes and stores these metrics.
A metrics adapter exposes Prometheus query results through the Kubernetes External Metrics API.
The HPA evaluates the metrics against your defined thresholds and adjusts the pod replica count.

Prerequisites

Before you begin, ensure that you have:

Application metrics collected by Prometheus. For more information, see Use Mixerless Telemetry to observe ASM instances
kubectl configured to connect to your cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster

Deploy the metrics adapter and load tester

Before creating HPAs, deploy a metrics adapter to bridge Prometheus and the Kubernetes Metrics API, and a load tester to simulate traffic for validation.

Deploy the metrics adapter

Run the following Helm command to install the metrics adapter in the kube-system namespace:

helm --kubeconfig <kubeconfig-path> -n kube-system install asm-custom-metrics \
  $KUBE_METRICS_ADAPTER_SRC/deploy/charts/kube-metrics-adapter \
  --set prometheus.url=http://prometheus.istio-system.svc:9090

Replace <kubeconfig-path> with the path to your kubeconfig file.

Note

For the complete deployment script, see demo_hpa.sh on GitHub.

Verify the metrics adapter

Check that the metrics adapter pod is running: Expected output:

   kubectl --kubeconfig <kubeconfig-path> get po -n kube-system | grep metrics-adapter

   asm-custom-metrics-kube-metrics-adapter-6fb4949988-ht8pv   1/1     Running     0          30s

Confirm that the autoscaling/v2beta2 API is available: Expected output:

   kubectl --kubeconfig <kubeconfig-path> api-versions | grep "autoscaling/v2beta"

   autoscaling/v2beta1
   autoscaling/v2beta2

Verify that the External Metrics API endpoint is accessible: Expected output: The resources array is empty because no HPA has registered external metrics yet.

   kubectl --kubeconfig <kubeconfig-path> get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

   {
     "kind": "APIResourceList",
     "apiVersion": "v1",
     "groupVersion": "external.metrics.k8s.io/v1beta1",
     "resources": []
   }

Deploy the Flagger load tester

Download the Flagger YAML files from the Flagger GitHub repository.

Deploy the load tester in the test namespace: Replace <flagger-path> with the local path to the downloaded Flagger repository.

   kubectl --kubeconfig <kubeconfig-path> apply -f <flagger-path>/kustomize/tester/deployment.yaml -n test
   kubectl --kubeconfig <kubeconfig-path> apply -f <flagger-path>/kustomize/tester/service.yaml -n test

Create HPAs based on Istio metrics

This section shows three HPA configurations, each targeting a different Istio metric. Choose the metric that best matches your scaling needs, or deploy multiple HPAs for the same workload.

Metrics reference

The following table summarizes the Istio metrics available for HPA scaling:

Metric	Type	Typical use
`istio_requests_total`	COUNTER	Scale based on requests per second
`istio_request_duration_milliseconds_sum` / `_count`	Used together to compute average latency	Scale based on average response time
`istio_request_duration_milliseconds_bucket`	HISTOGRAM	Scale based on percentile latency (P95, P99)

Common PromQL labels:

Label	Description
`destination_workload_namespace`	Namespace of the destination workload
`destination_canonical_service`	Canonical name of the destination service
`reporter`	`destination` for server-side metrics, `source` for client-side metrics

Option 1: Scale based on request count

This HPA scales the podinfo deployment when the per-second request rate exceeds a threshold.

Create a file named requests_total_hpa.yaml:

Key fields:

Field	Description
`annotations`	Defines the PromQL query. This query computes the per-second request rate for all services in the `test` namespace, using server-side (`reporter="destination"`) metrics over a 1-minute window.
`averageValue: "10"`	Triggers a scale-out when the average request rate reaches or exceeds 10 requests per second.
`maxReplicas` / `minReplicas`	Pod count stays between 1 and 5.

   apiVersion: autoscaling/v2beta2
   kind: HorizontalPodAutoscaler
   metadata:
     name: podinfo-total
     namespace: test
     annotations:
       metric-config.external.prometheus-query.prometheus/processed-requests-per-second: |
         sum(rate(istio_requests_total{destination_workload_namespace="test",reporter="destination"}[1m]))
   spec:
     maxReplicas: 5
     minReplicas: 1
     scaleTargetRef:
       apiVersion: apps/v1
       kind: Deployment
       name: podinfo
     metrics:
       - type: External
         external:
           metric:
             name: prometheus-query
             selector:
               matchLabels:
                 query-name: processed-requests-per-second
           target:
             type: AverageValue
             averageValue: "10"

Deploy the HPA:

   kubectl --kubeconfig <kubeconfig-path> apply -f resources_hpa/requests_total_hpa.yaml

Verify that the HPA registered its external metric: Expected output: The prometheus-query resource now appears in the list, confirming the HPA is connected to the metrics adapter.

   kubectl --kubeconfig <kubeconfig-path> get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

   {
     "kind": "APIResourceList",
     "apiVersion": "v1",
     "groupVersion": "external.metrics.k8s.io/v1beta1",
     "resources": [
       {
         "name": "prometheus-query",
         "singularName": "",
         "namespaced": true,
         "kind": "ExternalMetricValueList",
         "verbs": [
           "get"
         ]
       }
     ]
   }

Option 2: Scale based on average latency

This HPA scales the podinfo deployment when the average request latency exceeds a threshold.

Create a file named podinfo-latency-avg.yaml, then deploy it with kubectl apply:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: podinfo-latency-avg
  namespace: test
  annotations:
    metric-config.external.prometheus-query.prometheus/latency-average: |
      sum(rate(istio_request_duration_milliseconds_sum{destination_workload_namespace="test",reporter="destination"}[1m]))
      /sum(rate(istio_request_duration_milliseconds_count{destination_workload_namespace="test",reporter="destination"}[1m]))
spec:
  maxReplicas: 5
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  metrics:
    - type: External
      external:
        metric:
          name: prometheus-query
          selector:
            matchLabels:
              query-name: latency-average
        target:
          type: AverageValue
          averageValue: "0.005"

Key fields:

Field	Description
`annotations`	The PromQL query divides the total request duration by the request count to compute the average latency in seconds.
`averageValue: "0.005"`	Triggers a scale-out when the average latency reaches or exceeds 0.005s.

Option 3: Scale based on P95 latency

This HPA scales the podinfo deployment when the 95th percentile (P95) request latency exceeds a threshold.

Create a file named podinfo-p95.yaml, then deploy it with kubectl apply:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: podinfo-p95
  namespace: test
  annotations:
    metric-config.external.prometheus-query.prometheus/p95-latency: |
      histogram_quantile(0.95,sum(irate(istio_request_duration_milliseconds_bucket{destination_workload_namespace="test",destination_canonical_service="podinfo"}[5m]))by (le))
spec:
  maxReplicas: 5
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  metrics:
    - type: External
      external:
        metric:
          name: prometheus-query
          selector:
            matchLabels:
              query-name: p95-latency
        target:
          type: AverageValue
          averageValue: "4"

Key fields:

Field	Description
`annotations`	The PromQL query uses `histogram_quantile(0.95, ...)` to compute P95 latency from the histogram buckets. It uses `irate` with a 5-minute window for a smoother signal.
`averageValue: "4"`	Triggers a scale-out when the P95 latency reaches or exceeds 4 ms.

Verify autoscaling behavior

These steps validate the request-count-based HPA (podinfo-total). To test other HPAs, adjust the load parameters to trigger the corresponding metric threshold.

Send sustained traffic at 10 requests per second for 5 minutes:

Flag	Description
`-z 5m`	Send requests for 5 minutes
`-c 2`	Use 2 concurrent connections
`-q 10`	Send 10 requests per second

   alias k="kubectl --kubeconfig $USER_CONFIG"
   loadtester=$(k -n test get pod -l "app=flagger-loadtester" -o jsonpath='{.items..metadata.name}')
   k -n test exec -it ${loadtester} -c loadtester -- hey -z 5m -c 2 -q 10 http://podinfo:9898

In a separate terminal, watch the HPA status: Expected output: The REPLICAS column shows 2 -- the HPA scaled the deployment from 1 to 2 pods in response to the traffic load.

   watch kubectl --kubeconfig $USER_CONFIG -n test get hpa/podinfo-total

   NAME      REFERENCE            TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
   podinfo   Deployment/podinfo   10056m/10 (avg)   1         5         2          4m45s

Increase the traffic to 15 requests per second:

   alias k="kubectl --kubeconfig $USER_CONFIG"
   loadtester=$(k -n test get pod -l "app=flagger-loadtester" -o jsonpath='{.items..metadata.name}')
   k -n test exec -it ${loadtester} -c loadtester -- hey -z 5m -c 2 -q 15 http://podinfo:9898

Watch the HPA status again: Expected output: The REPLICAS column now shows 3. As traffic increases, the HPA adds pods up to the maxReplicas limit (5). When traffic drops, the HPA scales the deployment back down to minReplicas (1).

   watch kubectl --kubeconfig $USER_CONFIG -n test get hpa/podinfo-total

   NAME      REFERENCE            TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
   podinfo   Deployment/podinfo   10056m/10 (avg)   1         5         3          4m45s

What's next

Use Mixerless Telemetry to observe ASM instances -- Set up Prometheus metric collection for ASM.
Horizontal pod autoscaling -- Learn about advanced HPA features such as scaling behavior policies and stabilization windows.