Use Gateway with Inference Extension to monitor Generative AI requests - Container Compute Service

Gateway with Inference Extension exports metrics and access logs for generative AI (GenAI) requests following the OpenTelemetry (OTel) GenAI Semantic Conventions. This topic describes how to deploy the observability plugin, configure metrics and log output, and verify the results.

Background information

The OpenTelemetry GenAI Semantic Conventions are a set of standardized guidelines for monitoring and tracing Generative Artificial Intelligence (AI) applications, such as those using Large Language Models (LLMs), text generation, and image generation. These conventions aim to unify metrics, logs, and traces for GenAI requests, simplifying cross-system analysis and troubleshooting. The core objectives of the specification are:

Standardize data collection: Define common attributes for GenAI requests, such as model name, input and output token counts, and configuration parameters.
Enable end-to-end tracing: Correlate GenAI requests with traces from other systems, such as databases and API gateways.
Unify analysis and monitoring: Enable tools like Prometheus and Grafana to easily aggregate and visualize data through standardized labels.

Metrics reference

The following metric is exported for each GenAI request.

Metric	Type	Description	Key labels
`gen_ai_client_operation_duration`	Histogram	End-to-end duration of a GenAI operation	`gen_ai_operation_name`, `gen_ai_system`, `gen_ai_request_model`, `gen_ai_response_model`, `gen_ai_error_type`, `server_port`, `server_address`

Labels map to the following OTel GenAI attributes:

Label	Description	Example
`gen_ai_operation_name`	Type of GenAI operation	`chat`
`gen_ai_system`	GenAI provider or system	`example.com`
`gen_ai_request_model`	Model name from the request	`mock`
`gen_ai_response_model`	Model name from the response	`mock`
`gen_ai_error_type`	Error type if the request failed; empty if successful	(empty)
`server_port`	Upstream server port	`8000`
`server_address`	Upstream server address	`10.3.0.9:8000`

Prerequisites

Before you begin, make sure you have:

Gateway with Inference Extension 1.4.0 or later, installed with the Enable Gateway API Inference Extension option selected. For installation instructions, see Install Gateway with Inference Extension
The mock-vllm application deployed in your cluster

Configure observability data output

Deploy the GenAI observability plugin

The gen-ai-telemetry plugin is a WebAssembly (WASM) plugin delivered as a container image. It intercepts GenAI requests at the gateway level to extract token counts, model names, and timing data, then injects this data into the gateway's metrics and access logs.

The plugin must buffer the full request body to parse its contents. Enabling it increases gateway memory usage with request body size. If a request body exceeds the default buffer limit, the gateway returns HTTP 413. See Troubleshooting for how to increase the limit.

Apply the following EnvoyExtensionPolicy to deploy the plugin and attach it to the mock-route HTTPRoute:

kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyExtensionPolicy
metadata:
  name: ack-gateway-llm-telemetry
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: mock-route
  wasm:
  - name: llm-telemetry
    rootID: ack-gateway-extension
    code:
      type: Image
      image:
        url: registry-cn-hangzhou.ack.aliyuncs.com/acs/gen-ai-telemetry-wasmplugin:g76f5a66-aliyun
EOF

If your cluster cannot pull images over the public internet, use the VPC endpoint for your region instead. For example, for a cluster in the China (Beijing) region:

registry-cn-beijing-vpc.ack.aliyuncs.com/acs/gen-ai-telemetry-wasmplugin:<image_tag>

For available image tags, see gen-ai-telemetry plugin release history.

Configure gateway metrics tag rules

Deploying the mock-vllm application creates an EnvoyProxy resource named custom-proxy-config. To expose GenAI attributes as Prometheus labels, add metrics tag rules to this resource.

Apply the following configuration:

kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  bootstrap:
    type: JSONPatch
    jsonPatches:
    - op: add
      path: /stats_config
      value:
        stats_tags:
          - tag_name: gen_ai.operation.name
            regex: "(\\|gen_ai.operation.name=([^|]*))"
          - tag_name: gen_ai.system
            regex: "(\\|gen_ai.system=([^|]*))"
          - tag_name: gen_ai.token.type
            regex: "(\\|gen_ai.token.type=([^|]*))"
          - tag_name: gen_ai.request.model
            regex: "(\\|gen_ai.request.model=([^|]*))"
          - tag_name: gen_ai.response.model
            regex: "(\\|gen_ai.response.model=([^|]*))"
          - tag_name: gen_ai.error.type
            regex: "(\\|gen_ai.error.type=([^|]*))"
          - tag_name: server.port
            regex: "(\\|server.port=([^|]*))"
          - tag_name: server.address
            regex: "(\\|server.address=([^|]*))"
EOF

The configuration takes effect immediately after the resource is updated.

Configure log output

To add GenAI-specific fields to the gateway's access logs, apply the following spec.telemetry configuration to the custom-proxy-config resource:

kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  telemetry:
    accessLog:
      disable: false
      settings:
      - sinks:
        - type: File
          file:
            path: /dev/stdout
        format:
          type: JSON
          json:
            # Default access log fields
            start_time: "%START_TIME%"
            method: "%REQ(:METHOD)%"
            x-envoy-origin-path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
            protocol: "%PROTOCOL%"
            response_code: "%RESPONSE_CODE%"
            response_flags: "%RESPONSE_FLAGS%"
            response_code_details: "%RESPONSE_CODE_DETAILS%"
            connection_termination_details: "%CONNECTION_TERMINATION_DETAILS%"
            upstream_transport_failure_reason: "%UPSTREAM_TRANSPORT_FAILURE_REASON%"
            bytes_received: "%BYTES_RECEIVED%"
            bytes_sent: "%BYTES_SENT%"
            duration: "%DURATION%"
            x-envoy-upstream-service-time: "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
            x-forwarded-for: "%REQ(X-FORWARDED-FOR)%"
            user-agent: "%REQ(USER-AGENT)%"
            x-request-id: "%REQ(X-REQUEST-ID)%"
            :authority: "%REQ(:AUTHORITY)%"
            upstream_host: "%UPSTREAM_HOST%"
            upstream_cluster: "%UPSTREAM_CLUSTER%"
            upstream_local_address: "%UPSTREAM_LOCAL_ADDRESS%"
            downstream_local_address: "%DOWNSTREAM_LOCAL_ADDRESS%"
            downstream_remote_address: "%DOWNSTREAM_REMOTE_ADDRESS%"
            requested_server_name: "%REQUESTED_SERVER_NAME%"
            route_name: "%ROUTE_NAME%"
            # GenAI-specific fields
            gen_ai.operation.name: "%FILTER_STATE(wasm.gen_ai.operation.name:PLAIN)%"
            gen_ai.system: "%FILTER_STATE(wasm.gen_ai.system:PLAIN)%"
            gen_ai.request.model: "%FILTER_STATE(wasm.gen_ai.request.model:PLAIN)%"
            gen_ai.response.model: "%FILTER_STATE(wasm.gen_ai.response.model:PLAIN)%"
            gen_ai.error.type: "%FILTER_STATE(wasm.gen_ai.error.type:PLAIN)%"
            gen_ai.prompt.tokens: "%FILTER_STATE(wasm.gen_ai.prompt.tokens:PLAIN)%"
            gen_ai.completion.tokens: "%FILTER_STATE(wasm.gen_ai.completion.tokens:PLAIN)%"
            gen_ai.server.time_per_output_token: "%FILTER_STATE(wasm.gen_ai.server.time_per_output_token:PLAIN)%"
            gen_ai.server.time_to_first_token: "%FILTER_STATE(wasm.gen_ai.server.time_to_first_token:PLAIN)%"
EOF

The GenAI-specific log fields are populated by the gen-ai-telemetry WASM plugin from gateway filter state:

Log field	Description
`gen_ai.operation.name`	Type of GenAI operation (for example, `chat`)
`gen_ai.system`	GenAI provider or system
`gen_ai.request.model`	Model name from the request
`gen_ai.response.model`	Model name from the response
`gen_ai.error.type`	Error type if the request failed
`gen_ai.prompt.tokens`	Number of tokens in the prompt
`gen_ai.completion.tokens`	Number of tokens in the completion
`gen_ai.server.time_to_first_token`	Time to first token (TTFT)
`gen_ai.server.time_per_output_token`	Time per output token

Accurate token counts for streaming requests require plugin version g76f5a66-aliyun or later.

Send a test request

Follow the steps in Send a test request several times to generate observability data.

Verify the observability data

Get the name of the gateway workload.

export GATEWAY_DEPLOYMENT=$(kubectl -n envoy-gateway-system get deployment -l gateway.envoyproxy.io/owning-gateway-name=mock-gateway -o jsonpath='{.items[0].metadata.name}')
echo $GATEWAY_DEPLOYMENT

Forward the gateway's admin port to your local machine.

kubectl -n envoy-gateway-system port-forward deployments/$GATEWAY_DEPLOYMENT 19000:19000

Open a new terminal window and query the gateway metrics.

curl -s localhost:19000/stats/prometheus | grep gen_ai

The output lists histogram buckets for gen_ai_client_operation_duration, similar to:

# TYPE gen_ai_client_operation_duration histogram
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="0.5"} 0
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1"} 0
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="5"} 9
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="10"} 9
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="25"} 14
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="50"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="100"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="250"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="500"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="2500"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="5000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="10000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="30000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="60000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="300000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="600000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1800000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="3600000"} 16
gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="+Inf"} 16
gen_ai_client_operation_duration_sum{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000"} 140.9499999999999886313162278384
gen_ai_client_operation_duration_count{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000"} 16

View the access logs.

kubectl -n envoy-gateway-system logs deployments/$GATEWAY_DEPLOYMENT | tail -1

The most recent log entry is a JSON object that includes the GenAI-specific fields:

{
 ":authority": "example.com",
 "bytes_received": 184,
 "bytes_sent": 355,
 "connection_termination_details": null,
 "downstream_local_address": "10.3.0.38:10080",
 "downstream_remote_address": "10.3.15.252:45492",
 "duration": 2,
 "gen_ai.completion.tokens": "76",
 "gen_ai.error.type": "",
 "gen_ai.operation.name": "chat",
 "gen_ai.prompt.tokens": "18",
 "gen_ai.request.model": "mock",
 "gen_ai.response.model": "mock",
 "gen_ai.server.time_per_output_token": "0",
 "gen_ai.server.time_to_first_token": "2",
 "gen_ai.system": "example.com",
 "method": "POST",
 "protocol": "HTTP/1.1",
 "requested_server_name": null,
 "response_code": 200,
 "response_code_details": "via_upstream",
 "response_flags": "-",
 "route_name": "httproute/default/mock-route/rule/0/match/0/*",
 "start_time": "2024-05-28T06:13:31.190Z",
 "upstream_cluster": "httproute/default/mock-route/rule/0/backend/0",
 "upstream_host": "10.3.0.9:8000",
 "upstream_local_address": "10.3.0.38:33370",
 "upstream_transport_failure_reason": null,
 "user-agent": "curl/8.8.0",
 "x-envoy-origin-path": "/v1/chat/completions",
 "x-envoy-upstream-service-time": null,
 "x-forwarded-for": "10.3.15.252",
 "x-request-id": "0e67d734-aca7-4c80-bda3-79641cd63e2c"
}

Troubleshooting

"413 Request Entity Too Large" error

When the observability plugin is enabled, the gateway buffers the full request body to parse its contents. If a request body exceeds the default buffer limit, the gateway returns HTTP 413.

To increase the buffer limit, create a ClientTrafficPolicy resource. Replace ${GATEWAY_NAME} with the metadata.name of your Gateway resource.

Create a file named client-buffer-limit.yaml with the following content:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: client-buffer-limit
  # If your gateway is not in the default namespace, add the namespace field.
  # namespace:
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: ${GATEWAY_NAME}
  connection:
    bufferLimit: 20Mi     # Adjust the size as needed.

Apply the configuration.

kubectl apply -f client-buffer-limit.yaml

gen-ai-telemetry plugin release history

Image tag	Release time	Description
g2ad0869-aliyun	May 2025	Supports metric monitoring and log enhancement for generative AI requests
g76f5a66-aliyun	August 2025	Fixed inaccurate token counts for streaming requests