All Products
Search
Document Center

Alibaba Cloud Service Mesh:Implement LLM request caching function in ASM

Last Updated:Jan 13, 2025

LLM applications offer powerful capabilities but can incur significant computational costs and latency. To enhance efficiency and reduce latency, an increasing number of LLM applications are adopting cache policies. By storing and reusing computational results within a period of time, LLM Cache significantly cuts down on redundant calculations and optimizes response times, thus boosting overall system performance. This topic described the steps to seamlessly integrate LLM Cache into Service Mesh (ASM).

Overview

The LLM Cache feature in ASM is built on the Wasm extension mechanism. Once the mesh proxy receives LLM requests, LLM Cache first checks the custom cache service. If a cached result exists, the mesh proxy returns the cached results and does not send the request to the external LLM service. If no result is cached in your custom cache service, the request is marked with cache updating. Upon receiving a response from the external LLM service, the LLM Cache plug-in forwards the response to the cache service. The diagram below illustrates how does LLM Cache plug-in works using a gateway as an example:

image

ASM simplifies caching integration by offering a default cache service that uses string matching, underpinned by Redis storage. The mesh proxy communicates with your custom cache service via standard HTTP protocols. You have the flexibility to tailor the matching logic and storage format of the LLM Cache. This topic also references this cache service to demonstrate the deployment of the LLM request cache feature in ASM.

Prerequisites

Step 1: Deploy the default LLM Cache service in ASM

  1. Create a file named cache.yaml with the following content below.

    Click to view details

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: asm-wasm-cache-service-example
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: asm-wasm-cache-service-example
      labels:
        app: asm-wasm-cache-service-example
        service: asm-wasm-cache-service-example
    spec:
      ports:
      - name: http
        port: 80
        targetPort: 8080
      selector:
        app: asm-wasm-cache-service-example
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: asm-wasm-cache-service-example
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: asm-wasm-cache-service-example
          version: v1
      template:
        metadata:
          labels:
            app: asm-wasm-cache-service-example
            version: v1
          annotations:
            sidecar.istio.io/inject: "true"
        spec:
          tolerations:
          - key: "node.kubernetes.io/disk-pressure"
            operator: "Equal"
            value: ""
            effect: "NoSchedule"
          serviceAccountName: asm-wasm-cache-service-example
          containers:
          - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-cache-service-example:v1.22.6.9-g6fc05c9-aliyun # Example image address of the default Redis implementation provided by ASM.
            imagePullPolicy: IfNotPresent
            name: asm-wasm-cache-service-example
            ports:
            - containerPort: 8080
            env:
            - name: REDIS_ADDRESS
              value: ${Custom Redis address and port}
            - name: REDIS_PASSWORD
              value: ${Custom Redis key}
            - name: REDIS_EXPIRED_SECONDS
              value: ${The validity period of ApsaraDB for Redis, such as "600"}
            resources:
              limits:
                memory: 256Mi
                cpu: 200m
              requests:
                memory: 64Mi
                cpu: 50m
  2. Run the following command to deploy LLM Cache service by using the kubeconfig file of the cluster on the data plane.

    kubectl apply -f cache.yaml

    ASM allows you to customize LLM Cache services and provides exact match based on Redis by default. For more information, see Code Repository. For additional customization, refer to this example to develop your LLM Cache service.

Step 2: Deploy the LLM Cache plug-in in ASM

  1. Create a file named wasm.yaml with the following content.

    apiVersion: extensions.istio.io/v1alpha1
    kind: WasmPlugin
    metadata:
      name: llm-cache
      namespace: istio-system
    spec:
      selector:
        matchLabels:
          istio: ingressgateway
      failStrategy: FAIL_OPEN
      imagePullPolicy: IfNotPresent
      match:
      - mode: CLIENT
        ports:
        - number: 80
      phase: STATS
      pluginConfig:
        host_match: "dashscope.aliyuncs.com" # Supports regular expression matching
        path_match: ".*" # Supports regular expression matching
        service: "asm-wasm-cache-service-example.default.svc.cluster.local"
        port: "80"
      priority: -10
      url: registry-cn-hangzhou.ack.aliyuncs.com/acs/asm-wasm-llm-cache:v1.22.6.9-g6fc05c9-aliyun
  2. Run the following command to deploy the LLM Cache plug-in on the gateway.

    kubectl apply -f wasm.yaml

Step 3: Verification

Run the following command twice in succession for verification.

time curl --location '${ASM Gateway IP}:80' \
--header 'Content-Type: application/json' \
--header "host: test.com" \
--data '{
    "messages": [
        {"role": "user", "content": "Please introduce yourself"}
    ]
}'

Expected output:

{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1732068820,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%   
real    0m 4.09s
user    0m 0.00s
sys     0m 0.00s
{"choices":[{"message":{"role":"assistant","content":"Hello! I am Qwen, a pre-trained language model developed by Alibaba Cloud. My purpose is to assist users in generating various types of text, such as articles, stories, poems, and answering questions by leveraging my extensive knowledge and understanding of context. Although I'm an AI, I don't have a physical body or personal experiences like human beings do, but I've been trained on a vast corpus of text data, which allows me to engage in conversations, provide information, or help with various tasks to the best of my abilities. So, feel free to ask me anything, and I'll do my best to provide helpful and informative responses!"},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","usage":{"prompt_tokens":12,"completion_tokens":130,"total_tokens":142},"created":1732068930,"system_fingerprint":null,"model":"qwen1.5-72b-chat","id":"chatcmpl-3608dcd5-e3ad-9ade-bc70-xxxxxxxxxxxxxx"}%   
real    0m 0.00s
user    0m 0.00s
sys     0m 0.00s

The output shows that response speed for the second request is greatly shortened. This indicates that the LLM Cache service takes effect.