Implement LLM request caching function in ASM - Alibaba Cloud Service Mesh

LLM applications offer powerful capabilities but can incur significant computational costs and latency. To enhance efficiency and reduce latency, an increasing number of LLM applications are adopting cache policies. By storing and reusing computational results within a period of time, LLM Cache significantly cuts down on redundant calculations and optimizes response times, thus boosting overall system performance. This topic described the steps to seamlessly integrate LLM Cache into Service Mesh (ASM).

Overview

The LLM Cache feature in ASM is built on the Wasm extension mechanism. Once the mesh proxy receives LLM requests, LLM Cache first checks the custom cache service. If a cached result exists, the mesh proxy returns the cached results and does not send the request to the external LLM service. If no result is cached in your custom cache service, the request is marked with cache updating. Upon receiving a response from the external LLM service, the LLM Cache plug-in forwards the response to the cache service. The diagram below illustrates how does LLM Cache plug-in works using a gateway as an example:

ASM simplifies caching integration by offering a default cache service that uses string matching, underpinned by Redis storage. The mesh proxy communicates with your custom cache service via standard HTTP protocols. You have the flexibility to tailor the matching logic and storage format of the LLM Cache. This topic also references this cache service to demonstrate the deployment of the LLM request cache feature in ASM.

Prerequisites

Add a cluster to an ASM instance of version 1.22.6.72 or later.
Read and complete the step 1 to step 4 in External clients access LLM services through ASM ingress gateway to create an LLMProvider and associated resources.
Deploy the Redis service within the cluster or locally. Alternatively, you can use Tair (Redis OSS-compatible) for a quick Redis instance setup. For more information, see Overview.

Step 1: Deploy the default LLM Cache service in ASM

Create a file named cache.yaml with the following content below.

Click to view details

apiVersion: v1
kind: ServiceAccount
metadata:
  name: asm-wasm-cache-service-example
---
apiVersion: v1
kind: Service
metadata:
  name: asm-wasm-cache-service-example
  labels:
    app: asm-wasm-cache-service-example
    service: asm-wasm-cache-service-example
spec:
  ports:
  - name: http
    port: 80
    targetPort: 8080
  selector:
    app: asm-wasm-cache-service-example
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: asm-wasm-cache-service-example
spec:
  replicas: 1
  selector:
    matchLabels:
      app: asm-wasm-cache-service-example
      version: v1
  template:
    metadata:
      labels:
        app: asm-wasm-cache-service-example
        version: v1
      annotations:
        sidecar.istio.io/inject: "true"
    spec:
      tolerations:
      - key: "node.kubernetes.io/disk-pressure"
        operator: "Equal"
        value: ""
        effect: "NoSchedule"
      serviceAccountName: asm-wasm-cache-service-example
      containers:
      - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-cache-service-example:v1.22.6.9-g6fc05c9-aliyun # Example image address of the default Redis implementation provided by ASM.
        imagePullPolicy: IfNotPresent
        name: asm-wasm-cache-service-example
        ports:
        - containerPort: 8080
        env:
        - name: REDIS_ADDRESS
          value: ${Custom Redis address and port}
        - name: REDIS_PASSWORD
          value: ${Custom Redis key}
        - name: REDIS_EXPIRED_SECONDS
          value: ${The validity period of ApsaraDB for Redis, such as "600"}
        resources:
          limits:
            memory: 256Mi
            cpu: 200m
          requests:
            memory: 64Mi
            cpu: 50m

Run the following command to deploy LLM Cache service by using the kubeconfig file of the cluster on the data plane.
```
kubectl apply -f cache.yaml
```
ASM allows you to customize LLM Cache services and provides exact match based on Redis by default. For more information, see Code Repository. For additional customization, refer to this example to develop your LLM Cache service.

Step 2: Deploy the LLM Cache plug-in in ASM

Create a file named wasm.yaml with the following content.

apiVersion: extensions.istio.io/v1alpha1
kind: WasmPlugin
metadata:
  name: llm-cache
  namespace: istio-system
spec:
  selector:
    matchLabels:
      istio: ingressgateway
  failStrategy: FAIL_OPEN
  imagePullPolicy: IfNotPresent
  match:
  - mode: CLIENT
    ports:
    - number: 80
  phase: STATS
  pluginConfig:
    host_match: "dashscope.aliyuncs.com" # Supports regular expression matching
    path_match: ".*" # Supports regular expression matching
    service: "asm-wasm-cache-service-example.default.svc.cluster.local"
    port: "80"
  priority: -10
  url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-cache:v1.22.6.9-g6fc05c9-aliyun

Run the following command to deploy the LLM Cache plug-in on the gateway.
```
kubectl apply -f wasm.yaml
```

Step 3: Verification

Run the following command twice in succession for verification.

time curl --location '${ASM Gateway IP}:80' \
--header 'Content-Type: application/json' \
--header "host: test.com" \
--data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself"}
    ]
}'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1732068820,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%   
real    0m 4.09s
user    0m 0.00s
sys     0m 0.00s
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1732068930,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%   
real    0m 0.00s
user    0m 0.00s
sys     0m 0.00s

The output shows that response speed for the second request is greatly shortened. This indicates that the LLM Cache service takes effect.