All Products
Search
Document Center

Container Compute Service:Use Gateway with Inference Extension to monitor Generative AI requests

Last Updated:Mar 26, 2026

Gateway with Inference Extension exports metrics and access logs for generative AI (GenAI) requests following the OpenTelemetry (OTel) GenAI Semantic Conventions. This topic describes how to deploy the observability plugin, configure metrics and log output, and verify the results.

Background information

The OpenTelemetry GenAI Semantic Conventions are a set of standardized guidelines for monitoring and tracing Generative Artificial Intelligence (AI) applications, such as those using Large Language Models (LLMs), text generation, and image generation. These conventions aim to unify metrics, logs, and traces for GenAI requests, simplifying cross-system analysis and troubleshooting. The core objectives of the specification are:

  • Standardize data collection: Define common attributes for GenAI requests, such as model name, input and output token counts, and configuration parameters.

  • Enable end-to-end tracing: Correlate GenAI requests with traces from other systems, such as databases and API gateways.

  • Unify analysis and monitoring: Enable tools like Prometheus and Grafana to easily aggregate and visualize data through standardized labels.

Metrics reference

The following metric is exported for each GenAI request.

Metric Type Description Key labels
gen_ai_client_operation_duration Histogram End-to-end duration of a GenAI operation gen_ai_operation_name, gen_ai_system, gen_ai_request_model, gen_ai_response_model, gen_ai_error_type, server_port, server_address

Labels map to the following OTel GenAI attributes:

Label Description Example
gen_ai_operation_name Type of GenAI operation chat
gen_ai_system GenAI provider or system example.com
gen_ai_request_model Model name from the request mock
gen_ai_response_model Model name from the response mock
gen_ai_error_type Error type if the request failed; empty if successful (empty)
server_port Upstream server port 8000
server_address Upstream server address 10.3.0.9:8000

Prerequisites

Before you begin, make sure you have:

Configure observability data output

Deploy the GenAI observability plugin

The gen-ai-telemetry plugin is a WebAssembly (WASM) plugin delivered as a container image. It intercepts GenAI requests at the gateway level to extract token counts, model names, and timing data, then injects this data into the gateway's metrics and access logs.

The plugin must buffer the full request body to parse its contents. Enabling it increases gateway memory usage with request body size. If a request body exceeds the default buffer limit, the gateway returns HTTP 413. See Troubleshooting for how to increase the limit.

Apply the following EnvoyExtensionPolicy to deploy the plugin and attach it to the mock-route HTTPRoute:

kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyExtensionPolicy
metadata:
  name: ack-gateway-llm-telemetry
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: mock-route
  wasm:
  - name: llm-telemetry
    rootID: ack-gateway-extension
    code:
      type: Image
      image:
        url: registry-cn-hangzhou.ack.aliyuncs.com/acs/gen-ai-telemetry-wasmplugin:g76f5a66-aliyun
EOF

If your cluster cannot pull images over the public internet, use the VPC endpoint for your region instead. For example, for a cluster in the China (Beijing) region:

registry-cn-beijing-vpc.ack.aliyuncs.com/acs/gen-ai-telemetry-wasmplugin:<image_tag>

For available image tags, see gen-ai-telemetry plugin release history.

Configure gateway metrics tag rules

Deploying the mock-vllm application creates an EnvoyProxy resource named custom-proxy-config. To expose GenAI attributes as Prometheus labels, add metrics tag rules to this resource.

Apply the following configuration:

kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  bootstrap:
    type: JSONPatch
    jsonPatches:
    - op: add
      path: /stats_config
      value:
        stats_tags:
          - tag_name: gen_ai.operation.name
            regex: "(\\|gen_ai.operation.name=([^|]*))"
          - tag_name: gen_ai.system
            regex: "(\\|gen_ai.system=([^|]*))"
          - tag_name: gen_ai.token.type
            regex: "(\\|gen_ai.token.type=([^|]*))"
          - tag_name: gen_ai.request.model
            regex: "(\\|gen_ai.request.model=([^|]*))"
          - tag_name: gen_ai.response.model
            regex: "(\\|gen_ai.response.model=([^|]*))"
          - tag_name: gen_ai.error.type
            regex: "(\\|gen_ai.error.type=([^|]*))"
          - tag_name: server.port
            regex: "(\\|server.port=([^|]*))"
          - tag_name: server.address
            regex: "(\\|server.address=([^|]*))"
EOF

The configuration takes effect immediately after the resource is updated.

Configure log output

To add GenAI-specific fields to the gateway's access logs, apply the following spec.telemetry configuration to the custom-proxy-config resource:

kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  telemetry:
    accessLog:
      disable: false
      settings:
      - sinks:
        - type: File
          file:
            path: /dev/stdout
        format:
          type: JSON
          json:
            # Default access log fields
            start_time: "%START_TIME%"
            method: "%REQ(:METHOD)%"
            x-envoy-origin-path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
            protocol: "%PROTOCOL%"
            response_code: "%RESPONSE_CODE%"
            response_flags: "%RESPONSE_FLAGS%"
            response_code_details: "%RESPONSE_CODE_DETAILS%"
            connection_termination_details: "%CONNECTION_TERMINATION_DETAILS%"
            upstream_transport_failure_reason: "%UPSTREAM_TRANSPORT_FAILURE_REASON%"
            bytes_received: "%BYTES_RECEIVED%"
            bytes_sent: "%BYTES_SENT%"
            duration: "%DURATION%"
            x-envoy-upstream-service-time: "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
            x-forwarded-for: "%REQ(X-FORWARDED-FOR)%"
            user-agent: "%REQ(USER-AGENT)%"
            x-request-id: "%REQ(X-REQUEST-ID)%"
            :authority: "%REQ(:AUTHORITY)%"
            upstream_host: "%UPSTREAM_HOST%"
            upstream_cluster: "%UPSTREAM_CLUSTER%"
            upstream_local_address: "%UPSTREAM_LOCAL_ADDRESS%"
            downstream_local_address: "%DOWNSTREAM_LOCAL_ADDRESS%"
            downstream_remote_address: "%DOWNSTREAM_REMOTE_ADDRESS%"
            requested_server_name: "%REQUESTED_SERVER_NAME%"
            route_name: "%ROUTE_NAME%"
            # GenAI-specific fields
            gen_ai.operation.name: "%FILTER_STATE(wasm.gen_ai.operation.name:PLAIN)%"
            gen_ai.system: "%FILTER_STATE(wasm.gen_ai.system:PLAIN)%"
            gen_ai.request.model: "%FILTER_STATE(wasm.gen_ai.request.model:PLAIN)%"
            gen_ai.response.model: "%FILTER_STATE(wasm.gen_ai.response.model:PLAIN)%"
            gen_ai.error.type: "%FILTER_STATE(wasm.gen_ai.error.type:PLAIN)%"
            gen_ai.prompt.tokens: "%FILTER_STATE(wasm.gen_ai.prompt.tokens:PLAIN)%"
            gen_ai.completion.tokens: "%FILTER_STATE(wasm.gen_ai.completion.tokens:PLAIN)%"
            gen_ai.server.time_per_output_token: "%FILTER_STATE(wasm.gen_ai.server.time_per_output_token:PLAIN)%"
            gen_ai.server.time_to_first_token: "%FILTER_STATE(wasm.gen_ai.server.time_to_first_token:PLAIN)%"
EOF

The GenAI-specific log fields are populated by the gen-ai-telemetry WASM plugin from gateway filter state:

Log field Description
gen_ai.operation.name Type of GenAI operation (for example, chat)
gen_ai.system GenAI provider or system
gen_ai.request.model Model name from the request
gen_ai.response.model Model name from the response
gen_ai.error.type Error type if the request failed
gen_ai.prompt.tokens Number of tokens in the prompt
gen_ai.completion.tokens Number of tokens in the completion
gen_ai.server.time_to_first_token Time to first token (TTFT)
gen_ai.server.time_per_output_token Time per output token
Accurate token counts for streaming requests require plugin version g76f5a66-aliyun or later.

Send a test request

Follow the steps in Send a test request several times to generate observability data.

Verify the observability data

  1. Get the name of the gateway workload.

    export GATEWAY_DEPLOYMENT=$(kubectl -n envoy-gateway-system get deployment -l gateway.envoyproxy.io/owning-gateway-name=mock-gateway -o jsonpath='{.items[0].metadata.name}')
    echo $GATEWAY_DEPLOYMENT
  2. Forward the gateway's admin port to your local machine.

    kubectl -n envoy-gateway-system port-forward deployments/$GATEWAY_DEPLOYMENT 19000:19000
  3. Open a new terminal window and query the gateway metrics.

    curl -s localhost:19000/stats/prometheus | grep gen_ai

    The output lists histogram buckets for gen_ai_client_operation_duration, similar to:

    # TYPE gen_ai_client_operation_duration histogram
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="0.5"} 0
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1"} 0
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="5"} 9
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="10"} 9
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="25"} 14
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="50"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="100"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="250"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="500"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="2500"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="5000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="10000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="30000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="60000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="300000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="600000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1800000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="3600000"} 16
    gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="+Inf"} 16
    gen_ai_client_operation_duration_sum{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000"} 140.9499999999999886313162278384
    gen_ai_client_operation_duration_count{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000"} 16
  4. View the access logs.

    kubectl -n envoy-gateway-system logs deployments/$GATEWAY_DEPLOYMENT | tail -1

    The most recent log entry is a JSON object that includes the GenAI-specific fields:

    {
     ":authority": "example.com",
     "bytes_received": 184,
     "bytes_sent": 355,
     "connection_termination_details": null,
     "downstream_local_address": "10.3.0.38:10080",
     "downstream_remote_address": "10.3.15.252:45492",
     "duration": 2,
     "gen_ai.completion.tokens": "76",
     "gen_ai.error.type": "",
     "gen_ai.operation.name": "chat",
     "gen_ai.prompt.tokens": "18",
     "gen_ai.request.model": "mock",
     "gen_ai.response.model": "mock",
     "gen_ai.server.time_per_output_token": "0",
     "gen_ai.server.time_to_first_token": "2",
     "gen_ai.system": "example.com",
     "method": "POST",
     "protocol": "HTTP/1.1",
     "requested_server_name": null,
     "response_code": 200,
     "response_code_details": "via_upstream",
     "response_flags": "-",
     "route_name": "httproute/default/mock-route/rule/0/match/0/*",
     "start_time": "2024-05-28T06:13:31.190Z",
     "upstream_cluster": "httproute/default/mock-route/rule/0/backend/0",
     "upstream_host": "10.3.0.9:8000",
     "upstream_local_address": "10.3.0.38:33370",
     "upstream_transport_failure_reason": null,
     "user-agent": "curl/8.8.0",
     "x-envoy-origin-path": "/v1/chat/completions",
     "x-envoy-upstream-service-time": null,
     "x-forwarded-for": "10.3.15.252",
     "x-request-id": "0e67d734-aca7-4c80-bda3-79641cd63e2c"
    }

Troubleshooting

"413 Request Entity Too Large" error

When the observability plugin is enabled, the gateway buffers the full request body to parse its contents. If a request body exceeds the default buffer limit, the gateway returns HTTP 413.

To increase the buffer limit, create a ClientTrafficPolicy resource. Replace ${GATEWAY_NAME} with the metadata.name of your Gateway resource.

  1. Create a file named client-buffer-limit.yaml with the following content:

    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: ClientTrafficPolicy
    metadata:
      name: client-buffer-limit
      # If your gateway is not in the default namespace, add the namespace field.
      # namespace:
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: ${GATEWAY_NAME}
      connection:
        bufferLimit: 20Mi     # Adjust the size as needed.
  2. Apply the configuration.

    kubectl apply -f client-buffer-limit.yaml

gen-ai-telemetry plugin release history

Image tag Release time Description
g2ad0869-aliyun May 2025 Supports metric monitoring and log enhancement for generative AI requests
g76f5a66-aliyun August 2025 Fixed inaccurate token counts for streaming requests