LLM applications offer powerful capabilities but can incur significant computational costs and latency. To enhance efficiency and reduce latency, an increasing number of LLM applications are adopting cache policies. By storing and reusing computational results within a period of time, LLM Cache significantly cuts down on redundant calculations and optimizes response times, thus boosting overall system performance. This topic described the steps to seamlessly integrate LLM Cache into Service Mesh (ASM).
Overview
The LLM Cache feature in ASM is built on the Wasm extension mechanism. Once the mesh proxy receives LLM requests, LLM Cache first checks the custom cache service. If a cached result exists, the mesh proxy returns the cached results and does not send the request to the external LLM service. If no result is cached in your custom cache service, the request is marked with cache updating. Upon receiving a response from the external LLM service, the LLM Cache plug-in forwards the response to the cache service. The diagram below illustrates how does LLM Cache plug-in works using a gateway as an example:
ASM simplifies caching integration by offering a default cache service that uses string matching, underpinned by Redis storage. The mesh proxy communicates with your custom cache service via standard HTTP protocols. You have the flexibility to tailor the matching logic and storage format of the LLM Cache. This topic also references this cache service to demonstrate the deployment of the LLM request cache feature in ASM.
Prerequisites
Add a cluster to an ASM instance of version 1.22.6.72 or later.
Read and complete the step 1 to step 4 in External clients access LLM services through ASM ingress gateway to create an LLMProvider and associated resources.
Deploy the Redis service within the cluster or locally. Alternatively, you can use Tair (Redis OSS-compatible) for a quick Redis instance setup. For more information, see Overview.
Step 1: Deploy the default LLM Cache service in ASM
Create a file named cache.yaml with the following content below.
Run the following command to deploy LLM Cache service by using the kubeconfig file of the cluster on the data plane.
kubectl apply -f cache.yamlASM allows you to customize LLM Cache services and provides exact match based on Redis by default. For more information, see Code Repository. For additional customization, refer to this example to develop your LLM Cache service.
Step 2: Deploy the LLM Cache plug-in in ASM
Create a file named wasm.yaml with the following content.
apiVersion: extensions.istio.io/v1alpha1 kind: WasmPlugin metadata: name: llm-cache namespace: istio-system spec: selector: matchLabels: istio: ingressgateway failStrategy: FAIL_OPEN imagePullPolicy: IfNotPresent match: - mode: CLIENT ports: - number: 80 phase: STATS pluginConfig: host_match: "dashscope.aliyuncs.com" # Supports regular expression matching path_match: ".*" # Supports regular expression matching service: "asm-wasm-cache-service-example.default.svc.cluster.local" port: "80" priority: -10 url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-cache:v1.22.6.9-g6fc05c9-aliyunRun the following command to deploy the LLM Cache plug-in on the gateway.
kubectl apply -f wasm.yaml
Step 3: Verification
Run the following command twice in succession for verification.
time curl --location '${ASM Gateway IP}:80' \
--header 'Content-Type: application/json' \
--header "host: test.com" \
--data '{
"messages": [
{"role": "user", "content": "Please introduce yourself"}
]
}'Expected output:
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1732068820,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%
real 0m 4.09s
user 0m 0.00s
sys 0m 0.00s
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1732068930,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00sThe output shows that response speed for the second request is greatly shortened. This indicates that the LLM Cache service takes effect.