Gateway with Inference Extension exports metrics and access logs for generative AI (GenAI) requests following the OpenTelemetry (OTel) GenAI Semantic Conventions. This topic describes how to deploy the observability plugin, configure metrics and log output, and verify the results.
Background information
The OpenTelemetry GenAI Semantic Conventions are a set of standardized guidelines for monitoring and tracing Generative Artificial Intelligence (AI) applications, such as those using Large Language Models (LLMs), text generation, and image generation. These conventions aim to unify metrics, logs, and traces for GenAI requests, simplifying cross-system analysis and troubleshooting. The core objectives of the specification are:
-
Standardize data collection: Define common attributes for GenAI requests, such as model name, input and output token counts, and configuration parameters.
-
Enable end-to-end tracing: Correlate GenAI requests with traces from other systems, such as databases and API gateways.
-
Unify analysis and monitoring: Enable tools like Prometheus and Grafana to easily aggregate and visualize data through standardized labels.
Metrics reference
The following metric is exported for each GenAI request.
| Metric | Type | Description | Key labels |
|---|---|---|---|
gen_ai_client_operation_duration |
Histogram | End-to-end duration of a GenAI operation | gen_ai_operation_name, gen_ai_system, gen_ai_request_model, gen_ai_response_model, gen_ai_error_type, server_port, server_address |
Labels map to the following OTel GenAI attributes:
| Label | Description | Example |
|---|---|---|
gen_ai_operation_name |
Type of GenAI operation | chat |
gen_ai_system |
GenAI provider or system | example.com |
gen_ai_request_model |
Model name from the request | mock |
gen_ai_response_model |
Model name from the response | mock |
gen_ai_error_type |
Error type if the request failed; empty if successful | (empty) |
server_port |
Upstream server port | 8000 |
server_address |
Upstream server address | 10.3.0.9:8000 |
Prerequisites
Before you begin, make sure you have:
-
Gateway with Inference Extension 1.4.0 or later, installed with the Enable Gateway API Inference Extension option selected. For installation instructions, see Install Gateway with Inference Extension
-
The mock-vllm application deployed in your cluster
Configure observability data output
Deploy the GenAI observability plugin
The gen-ai-telemetry plugin is a WebAssembly (WASM) plugin delivered as a container image. It intercepts GenAI requests at the gateway level to extract token counts, model names, and timing data, then injects this data into the gateway's metrics and access logs.
The plugin must buffer the full request body to parse its contents. Enabling it increases gateway memory usage with request body size. If a request body exceeds the default buffer limit, the gateway returns HTTP 413. See Troubleshooting for how to increase the limit.
Apply the following EnvoyExtensionPolicy to deploy the plugin and attach it to the mock-route HTTPRoute:
kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyExtensionPolicy
metadata:
name: ack-gateway-llm-telemetry
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: mock-route
wasm:
- name: llm-telemetry
rootID: ack-gateway-extension
code:
type: Image
image:
url: registry-cn-hangzhou.ack.aliyuncs.com/acs/gen-ai-telemetry-wasmplugin:g76f5a66-aliyun
EOF
If your cluster cannot pull images over the public internet, use the VPC endpoint for your region instead. For example, for a cluster in the China (Beijing) region:
registry-cn-beijing-vpc.ack.aliyuncs.com/acs/gen-ai-telemetry-wasmplugin:<image_tag>
For available image tags, see gen-ai-telemetry plugin release history.
Configure gateway metrics tag rules
Deploying the mock-vllm application creates an EnvoyProxy resource named custom-proxy-config. To expose GenAI attributes as Prometheus labels, add metrics tag rules to this resource.
Apply the following configuration:
kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: custom-proxy-config
namespace: default
spec:
bootstrap:
type: JSONPatch
jsonPatches:
- op: add
path: /stats_config
value:
stats_tags:
- tag_name: gen_ai.operation.name
regex: "(\\|gen_ai.operation.name=([^|]*))"
- tag_name: gen_ai.system
regex: "(\\|gen_ai.system=([^|]*))"
- tag_name: gen_ai.token.type
regex: "(\\|gen_ai.token.type=([^|]*))"
- tag_name: gen_ai.request.model
regex: "(\\|gen_ai.request.model=([^|]*))"
- tag_name: gen_ai.response.model
regex: "(\\|gen_ai.response.model=([^|]*))"
- tag_name: gen_ai.error.type
regex: "(\\|gen_ai.error.type=([^|]*))"
- tag_name: server.port
regex: "(\\|server.port=([^|]*))"
- tag_name: server.address
regex: "(\\|server.address=([^|]*))"
EOF
The configuration takes effect immediately after the resource is updated.
Configure log output
To add GenAI-specific fields to the gateway's access logs, apply the following spec.telemetry configuration to the custom-proxy-config resource:
kubectl apply -f - <<EOF
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: custom-proxy-config
namespace: default
spec:
telemetry:
accessLog:
disable: false
settings:
- sinks:
- type: File
file:
path: /dev/stdout
format:
type: JSON
json:
# Default access log fields
start_time: "%START_TIME%"
method: "%REQ(:METHOD)%"
x-envoy-origin-path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
protocol: "%PROTOCOL%"
response_code: "%RESPONSE_CODE%"
response_flags: "%RESPONSE_FLAGS%"
response_code_details: "%RESPONSE_CODE_DETAILS%"
connection_termination_details: "%CONNECTION_TERMINATION_DETAILS%"
upstream_transport_failure_reason: "%UPSTREAM_TRANSPORT_FAILURE_REASON%"
bytes_received: "%BYTES_RECEIVED%"
bytes_sent: "%BYTES_SENT%"
duration: "%DURATION%"
x-envoy-upstream-service-time: "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
x-forwarded-for: "%REQ(X-FORWARDED-FOR)%"
user-agent: "%REQ(USER-AGENT)%"
x-request-id: "%REQ(X-REQUEST-ID)%"
:authority: "%REQ(:AUTHORITY)%"
upstream_host: "%UPSTREAM_HOST%"
upstream_cluster: "%UPSTREAM_CLUSTER%"
upstream_local_address: "%UPSTREAM_LOCAL_ADDRESS%"
downstream_local_address: "%DOWNSTREAM_LOCAL_ADDRESS%"
downstream_remote_address: "%DOWNSTREAM_REMOTE_ADDRESS%"
requested_server_name: "%REQUESTED_SERVER_NAME%"
route_name: "%ROUTE_NAME%"
# GenAI-specific fields
gen_ai.operation.name: "%FILTER_STATE(wasm.gen_ai.operation.name:PLAIN)%"
gen_ai.system: "%FILTER_STATE(wasm.gen_ai.system:PLAIN)%"
gen_ai.request.model: "%FILTER_STATE(wasm.gen_ai.request.model:PLAIN)%"
gen_ai.response.model: "%FILTER_STATE(wasm.gen_ai.response.model:PLAIN)%"
gen_ai.error.type: "%FILTER_STATE(wasm.gen_ai.error.type:PLAIN)%"
gen_ai.prompt.tokens: "%FILTER_STATE(wasm.gen_ai.prompt.tokens:PLAIN)%"
gen_ai.completion.tokens: "%FILTER_STATE(wasm.gen_ai.completion.tokens:PLAIN)%"
gen_ai.server.time_per_output_token: "%FILTER_STATE(wasm.gen_ai.server.time_per_output_token:PLAIN)%"
gen_ai.server.time_to_first_token: "%FILTER_STATE(wasm.gen_ai.server.time_to_first_token:PLAIN)%"
EOF
The GenAI-specific log fields are populated by the gen-ai-telemetry WASM plugin from gateway filter state:
| Log field | Description |
|---|---|
gen_ai.operation.name |
Type of GenAI operation (for example, chat) |
gen_ai.system |
GenAI provider or system |
gen_ai.request.model |
Model name from the request |
gen_ai.response.model |
Model name from the response |
gen_ai.error.type |
Error type if the request failed |
gen_ai.prompt.tokens |
Number of tokens in the prompt |
gen_ai.completion.tokens |
Number of tokens in the completion |
gen_ai.server.time_to_first_token |
Time to first token (TTFT) |
gen_ai.server.time_per_output_token |
Time per output token |
Accurate token counts for streaming requests require plugin version g76f5a66-aliyun or later.
Send a test request
Follow the steps in Send a test request several times to generate observability data.
Verify the observability data
-
Get the name of the gateway workload.
export GATEWAY_DEPLOYMENT=$(kubectl -n envoy-gateway-system get deployment -l gateway.envoyproxy.io/owning-gateway-name=mock-gateway -o jsonpath='{.items[0].metadata.name}') echo $GATEWAY_DEPLOYMENT -
Forward the gateway's admin port to your local machine.
kubectl -n envoy-gateway-system port-forward deployments/$GATEWAY_DEPLOYMENT 19000:19000 -
Open a new terminal window and query the gateway metrics.
curl -s localhost:19000/stats/prometheus | grep gen_aiThe output lists histogram buckets for
gen_ai_client_operation_duration, similar to:# TYPE gen_ai_client_operation_duration histogram gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="0.5"} 0 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1"} 0 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="5"} 9 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="10"} 9 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="25"} 14 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="50"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="100"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="250"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="500"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="2500"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="5000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="10000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="30000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="60000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="300000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="600000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="1800000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="3600000"} 16 gen_ai_client_operation_duration_bucket{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000",le="+Inf"} 16 gen_ai_client_operation_duration_sum{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000"} 140.9499999999999886313162278384 gen_ai_client_operation_duration_count{gen_ai_operation_name="chat",gen_ai_system="example.com",gen_ai_request_model="mock",gen_ai_response_model="mock",gen_ai_error_type="",server_port="8000",server_address="10.3.0.9:8000"} 16 -
View the access logs.
kubectl -n envoy-gateway-system logs deployments/$GATEWAY_DEPLOYMENT | tail -1The most recent log entry is a JSON object that includes the GenAI-specific fields:
{ ":authority": "example.com", "bytes_received": 184, "bytes_sent": 355, "connection_termination_details": null, "downstream_local_address": "10.3.0.38:10080", "downstream_remote_address": "10.3.15.252:45492", "duration": 2, "gen_ai.completion.tokens": "76", "gen_ai.error.type": "", "gen_ai.operation.name": "chat", "gen_ai.prompt.tokens": "18", "gen_ai.request.model": "mock", "gen_ai.response.model": "mock", "gen_ai.server.time_per_output_token": "0", "gen_ai.server.time_to_first_token": "2", "gen_ai.system": "example.com", "method": "POST", "protocol": "HTTP/1.1", "requested_server_name": null, "response_code": 200, "response_code_details": "via_upstream", "response_flags": "-", "route_name": "httproute/default/mock-route/rule/0/match/0/*", "start_time": "2024-05-28T06:13:31.190Z", "upstream_cluster": "httproute/default/mock-route/rule/0/backend/0", "upstream_host": "10.3.0.9:8000", "upstream_local_address": "10.3.0.38:33370", "upstream_transport_failure_reason": null, "user-agent": "curl/8.8.0", "x-envoy-origin-path": "/v1/chat/completions", "x-envoy-upstream-service-time": null, "x-forwarded-for": "10.3.15.252", "x-request-id": "0e67d734-aca7-4c80-bda3-79641cd63e2c" }
Troubleshooting
"413 Request Entity Too Large" error
When the observability plugin is enabled, the gateway buffers the full request body to parse its contents. If a request body exceeds the default buffer limit, the gateway returns HTTP 413.
To increase the buffer limit, create a ClientTrafficPolicy resource. Replace ${GATEWAY_NAME} with the metadata.name of your Gateway resource.
-
Create a file named
client-buffer-limit.yamlwith the following content:apiVersion: gateway.envoyproxy.io/v1alpha1 kind: ClientTrafficPolicy metadata: name: client-buffer-limit # If your gateway is not in the default namespace, add the namespace field. # namespace: spec: targetRefs: - group: gateway.networking.k8s.io kind: Gateway name: ${GATEWAY_NAME} connection: bufferLimit: 20Mi # Adjust the size as needed. -
Apply the configuration.
kubectl apply -f client-buffer-limit.yaml
gen-ai-telemetry plugin release history
| Image tag | Release time | Description |
|---|---|---|
| g2ad0869-aliyun | May 2025 | Supports metric monitoring and log enhancement for generative AI requests |
| g76f5a66-aliyun | August 2025 | Fixed inaccurate token counts for streaming requests |