LLM traffic management - Alibaba Cloud Service Mesh - Alibaba Cloud Documentation Center

Most major large language model (LLM) providers offer services over the HTTP protocol. Service Mesh (ASM) optimizes HTTP-based LLM requests. ASM supports the protocol standards of multiple major LLM providers and provides a simple and efficient integration experience. This topic describes how to manage LLM traffic in ASM, focusing on traffic routing and observability.

Feature overview

Traffic routing

In a service mesh, to register a regular external HTTP service in a cluster, you first configure a ServiceEntry and then configure routing rules using a VirtualService. You can then call this external service through a gateway or an application pod. If you call the service directly without registration, you cannot use the traffic management and observability features that the service mesh provides.

However, a native ServiceEntry can only handle regular TCP and HTTP traffic. LLM requests have specific advanced parameters that extend the HTTP protocol, which a standard ServiceEntry cannot support. To address this, ASM introduces two new resources:

LLMProvider: An LLMProvider is analogous to a ServiceEntry for the HTTP protocol. You can use this resource to register external LLM service providers with the cluster and configure the provider's host, API key, and other model parameters.
LLMRoute: An LLMRoute is analogous to a VirtualService for the HTTP protocol. You can use an LLMRoute to configure traffic rules and distribute traffic to specific LLMProviders based on weights or matching conditions.

Based on the LLMRoute and LLMProvider configurations, ASM dynamically selects a routing destination, adds preconfigured request parameters, and sends the request to the corresponding provider. This configuration lets you quickly change provider configurations, select different models based on request characteristics, and perform operations such as grayscale traffic shifting between providers. This greatly reduces the complexity of integrating large models into the cluster. The following two scenarios describe how to manage LLM traffic using LLMRoute and LLMProvider.

Configure an LLMRoute to use different models for different user types

Alibaba Cloud Model Studio provides two models: qwen-1.8b-chat and qwen-turbo. You can create and configure an LLMRoute to route calls from regular users to the default qwen-1.8b-chat model, and calls from subscribed users to the more powerful qwen-turbo model. Requests from subscribed users contain a special header that identifies their status.

Configure an LLMProvider and an LLMRoute to distribute traffic by weight

This scenario combines the Alibaba Cloud Model Studio and Moonshot language model services. You can configure an LLMRoute and an LLMProvider to route traffic between the different LLMProviders by weight.

Note

demo-llm-server is a regular service in the cluster and does not correspond to any endpoint.

Traffic observability

In addition to LLM request routing, ASM provides enhanced observability to meet the advanced requirements of LLM scenarios. A robust software system must have accurate and clear observable data. This allows operations and maintenance (O&M) staff and developers to check the current operational status of their services at any time and respond appropriately.

The observability features in a service mesh include three main components:

Access logs
Monitoring metrics
Tracing Analysis

Because LLM requests are based on the HTTP protocol, they are directly compatible with the existing Tracing Analysis feature. However, the current access log and monitoring metrics features are insufficient for observing LLM requests. For example, access logs cannot output LLM-specific information, such as the model used in a request, and monitoring metrics can only reflect standard HTTP information. Therefore, ASM enhances its access log and monitoring metrics features. These enhancements include two main aspects:

Access logs: You can use the custom access log format feature to print LLM-specific information in access logs.
Monitoring metrics:
- ASM adds two new monitoring metrics to show the number of input tokens (prompt tokens) and output tokens (completion tokens) for a request.
- LLM-specific information is added as a metric dimension that you can reference in standard Istio metrics.

Scenario overview

By integrating LLMs with ASM, you can implement grayscale releases, weighted routing, and various observability features. This further decouples applications from LLMProviders and improves the robustness and maintainability of the entire call chain. The following scenarios describe how to configure and implement LLM traffic routing and observability features.