For scenarios with highly repetitive AI requests, AI Gateway has upgraded its caching capabilities. It combines Redis precise cache and DashVector semantic cache to reduce costs and improve the efficiency of large language model (LLM) calls. This topic describes the features, benefits, and configuration methods for semantic and precise cache policies.
Key concepts of semantic cache
To help you better understand the semantic caching mechanism, the following section describes its key concepts:
Vector embedding
The core of semantic cache is using vector technology to match user intent. When a user makes a new request, the system first converts the text into a high-dimensional vector through text embedding. This process is called vector embedding.
These vectors accurately capture the semantic features of the text. For example, although the words 'Apple phone' and 'iPhone' are different, their vectors are very close in the vector space. This vectorized representation overcomes the limitation of traditional precise cache, which requires an exact text match. This allows the system to understand that 'tomorrow's weather in Beijing' and 'Beijing's weather forecast for the next 24 hours' are semantically similar.
Vector comparison
After the vectors are generated, the system uses the cosine similarity algorithm to calculate the angle between the new request's vector and the vectors of cached requests. When the similarity reaches a preset threshold, it triggers a cached response. A threshold value between 0.8 and 0.9 is recommended, but you can adjust it as needed.
This mechanism allows the system to intelligently identify synonymous expressions. For example, in an E-commerce customer service scenario, a user might ask 'When will this package arrive?' or 'What is the estimated delivery time?'. The system can match both questions to the same cached answer: 'According to the logistics information, your package will be delivered before 3:00 PM tomorrow.'
Vector database
To manage vectors efficiently, we use the DashVector. This type of database uses a Hierarchical Navigable Small World (HNSW) algorithm to retrieve millions of vectors in milliseconds and supports dynamic updates. Compared with traditional precise cache, semantic cache not only supports matching an infinite number of semantic variants but also improves resource utilization by sharing storage space for semantically similar items.
Strategy comparison
Strategy | Precise cache (Exact match) | Semantic cache |
Matching method | Exact string match. | Determines distance in vector space using cosine similarity. |
Fault tolerance | Does not recognize synonymous or nearly synonymous expressions. | Supports fuzzy matching for synonyms, sentence variations, and more. |
Response speed | Millisecond-level (local key-value query). | Millisecond-level (nearest neighbor search in a vector database). |
Typical scenarios | Standard FAQs and API calls with fixed parameters. | Natural language instructions, multi-round conversations, and fuzzy queries. |
Cost-effectiveness | Ideal for high-frequency, repetitive requests. | Ideal for scenarios with high semantic diversity and similar user intents. |
Features and benefits
Dual-mode cache system: You can choose between two cache types and dynamically adjust the matching threshold as needed.
Exact match: Based on the Redis key-value storage architecture, it provides millisecond-level responses to identical requests.
Semantic cache: Uses the DashVector vector database (Alibaba Cloud Vector Retrieval Service) to intelligently match semantically similar requests. The similarity threshold is adjustable. This overcomes the limitations of traditional string matching.
Reduced redundant computation: For identical AI requests, the cached response data is returned directly, which avoids redundant calls to the large language model.
Performance improvement: By quickly retrieving results from the cache, it significantly reduces the response time and the load on backend servers. Semantic cache enables intent-level responses, which significantly improves user satisfaction and experience.
Expanded scenario coverage: It is suitable for standard scenarios, such as customer service systems and knowledge base queries, and supports processing natural language instruction variants, such as 'tomorrow's weather' and 'weather forecast for the next 24 hours'.
Log Monitoring: Provides analysis of cache hit ratio metrics.
Procedure
Log on to the AI Gateway console and choose Instance. In the top menu bar, select a region, then click the target instance ID.
In the navigation pane on the left, choose Model API, then click the target API name to go to the API Details page.
Click Policies and Plug-ins, enable the Cache switch, and configure the parameters.
AI Gateway has upgraded its caching capabilities to support Semantic and Exact Match. Select the appropriate cache mode based on the Strategy comparison.
Semantic cache
ImportantIf a request contains the
x-higress-skip-ai-cache: onrequest header, the request bypasses the cache. It is forwarded directly to the backend service, and the response is not cached.
Cache Key Strategy: Select the default option, Latest Query Only, or select Integrate Historical Queries based on your requirements.
TextCectoriazation:
AI Services: Select the AI service that you have created. If you do not have an AI service, click Create Service to create one.
Model Name: Select the name of the model that you want to use.
Timeout Period: Set the timeout period. The default value is 5000 ms.
Vector Database Configuration:
Service Provider: If you have not activated DashVector (Vector Retrieval Service), click Go to console to open the service activation page. Follow these steps: . Record the collection name for later use.
ImportantWhen you create a collection, select
Cosineas the distance measure. The vector dimension must match the dimension of the text embedding model. For more information about the output vector dimensions of the text embedding models on the Alibaba Cloud Model Studio platform, see General-purpose text embeddings.Endpoint: Enter your DashVector endpoint.
Collection Name: Enter the name of the collection that you created.
API Key: The access credential. For more information, see Manage API keys.
Vector Similarity Threshold: This value determines how strictly queries are matched to cached content. The value can range from 0 to 1. A value of 0.8 or 0.9 is recommended. A larger value indicates greater semantic similarity. For more information, see Vector similarity threshold configuration guide.
Timeout: Set the timeout period. The default value is 3000 ms.
Exact match
ImportantIn the Redis console, add the VPC CIDR block of the gateway instance to the whitelist.
If a request contains the
x-higress-skip-ai-cache: onrequest header, the request bypasses the cache. It is forwarded directly to the backend service, and the response is not cached.

Cache Key Strategy: Select Latest Query Only (default) or Integrate Historical Queries as needed.
Redis cache configuration:
Redis service URL: Enter your Redis endpoint.
Port: Enter your port number.
Access Method: Select the method for accessing the Redis service. The options are Account+Password , Password-only, and Password-free.
Database Account: Enter the database account if you log on using an account and password.
Database Password: If you select password-based authentication, you must enter the database password.
Database No.: The database number.
Duration (seconds): The default cache duration is 1800 seconds. During this period, if the API receives an identical AI request, the LLM is not called, and the cached response is returned directly.
Confirm the configuration and click Save.
Vector similarity threshold configuration guide
Core concepts
The vector similarity threshold is a key parameter that controls the matching sensitivity of the semantic cache. It determines how strictly queries are matched to cached content.
Value range: A value from 0.0 (completely dissimilar) to 1.0 (completely similar).
Recommended range: 0.8 to 0.9 (adjustable based on business requirements). Do not set the value to less than 0.8.
Lower threshold (such as 0.75): Even if users use different expressions, a cached result is returned as long as the semantics are similar.
Higher threshold (such as 0.99): A cached result is returned only when the user uses an almost identical expression.
Why is a value of 0.8 or higher recommended?
When the threshold is lower than 0.8, the system may misjudge irrelevant queries as matches. This can lead to false positives (the system incorrectly returns irrelevant results), which can negatively affect the user experience or business accuracy.
Example of effect comparison
Example configuration | Similarity | Query example | Feature description |
| 1.0 | "When will my package arrive?" | Use this entry as the benchmark for comparison. |
0.89 | "What is the estimated delivery time for my package?" | Matches "When will my package arrive?". A hit is recorded. | |
0.86 | "Can my package be delivered today?" | Matches "When will my package arrive?". A hit is recorded. | |
0.83 | "Where can my package be delivered today?" | No hit is recorded. |
Tuning suggestions
Benchmark testing: You can start with the default value of 0.8 and gradually decrease or increase the threshold to observe changes in the cache hit rate.
Scenario adaptation:
For time-sensitive queries (such as real-time logistics tracking), a higher similarity threshold is recommended.
For scenarios that require highly standardized answers (such as FAQ responses), a lower similarity threshold can be used.
Performance Balance: Lowering the threshold increases the number of LLM calls.
FAQs
Q: How do I determine the optimal threshold?
A: By comparing the results of A/B testing:
Calculate the cache hit ratio versus the LLM call cost.
Collect user feedback on irrelevant cached answers (for example, "Why did my new question return an old answer?").
Monitor the response time fluctuations of key queries (such as "real-time queries" versus "historical queries").
We recommend that you re-evaluate the threshold settings periodically (for example, monthly) based on the latest business data. During peak hours, you can consider lowering the threshold to handle the surge in query volume.
Cache hit ratio
You must activate Simple Log Service before you can query the cache hit ratio.
The following sample search statement shows how to query the cache hit ratio at the gateway level:
cluster_id:{your-gatewayId} and inner-ai-cache-{your-gatewayId} | SELECT
SUM(CASE WHEN content LIKE '%cache hit for key%' OR content LIKE '%key accepted%' THEN 1 ELSE 0 END) AS hit_count,
SUM(CASE WHEN content LIKE '%cache miss for key%' OR content LIKE '%score not meet the threshold%' THEN 1 ELSE 0 END) AS miss_count,
SUM(CASE WHEN content LIKE '%cache hit for key%' OR content LIKE '%key accepted%' THEN 1 ELSE 0 END) * 100.0 /
NULLIF(SUM(CASE WHEN content LIKE '%cache hit for key%' OR content LIKE '%key accepted%' OR content LIKE '%cache miss for key%'
OR content LIKE '%score not meet the threshold%' THEN 1 ELSE 0 END), 0) AS hit_rateReplace
{your-gatewayId}with your gateway instance ID. Note that the first placeholder requires thegw-prefix, but the second does not. If you open the log query page from the cache settings, the `cluster_id` is automatically included. In this case, you only need to paste the rest of the search statement.
The query result shows the cache hit ratio.

Examples
When Precise Cache is enabled, only identical queries are matched:

When Semantic Cache is enabled, semantically similar queries can also be matched. If the semantic similarity is lower than the threshold, a match is not found:

