All Products
Search
Document Center

Microservices Engine:AI RAG (Retrieval-Augmented Generation)

Last Updated:Mar 11, 2026

The ai-rag plug-in adds Retrieval Augmented Generation (RAG) to a cloud-native gateway by connecting Alibaba Cloud DashVector with Qwen. Instead of relying solely on a large language model's (LLM) training data, the plug-in retrieves relevant documents from a vector database at request time and injects them into the prompt -- producing responses grounded in your actual data.

By handling RAG at the gateway level, you centralize retrieval logic in your infrastructure rather than embedding it in each application. This simplifies your RAG workflow, limits vector database access to the gateway, and lets platform teams manage configurations without requiring changes to application code.

How it works

The RAG workflow has two phases: data preparation (offline) and retrieval + generation (per request).

Data preparation (offline)

  1. Extract text from your source documents (PDFs, articles, knowledge base entries).

  2. Split the text into semantically meaningful chunks.

  3. Convert each chunk into a vector embedding and store it in an Alibaba Cloud DashVector collection.

Retrieval and generation (per request)

  1. A user sends a query to the cloud-native gateway.

  2. The ai-rag plug-in converts the query into a vector embedding.

  3. It searches Alibaba Cloud DashVector for the top-k most relevant document chunks, filtered by a vector distance threshold.

  4. It injects the retrieved chunks into the prompt and forwards the enriched prompt to Qwen.

  5. Qwen generates a response grounded in both the retrieved context and its own knowledge.

RAG architecture

Runtime attributes

AttributeValue
Execution stagedefault stage
Execution priority400

Configuration

Configure connections to two backend services: Qwen (the LLM) and Alibaba Cloud DashVector (the vector database). All parameters are required.

Qwen (LLM) parameters

ParameterTypeDescription
dashscope.apiKeystringAPI key for authenticating with Qwen.
dashscope.serviceFQDNstringFully qualified domain name of the Qwen service within the gateway's service registry.
dashscope.servicePortintPort of the Qwen service.
dashscope.serviceHoststringDomain name for accessing Qwen (used in the HTTP Host header).

DashVector (vector database) parameters

ParameterTypeDescription
dashvector.apiKeystringAPI key for authenticating with Alibaba Cloud DashVector.
dashvector.serviceFQDNstringFully qualified domain name of the DashVector service within the gateway's service registry.
dashvector.servicePortintPort of the DashVector service.
dashvector.serviceHoststringDomain name for accessing Alibaba Cloud DashVector (used in the HTTP Host header).
dashvector.collectionstringName of the DashVector collection to search.
dashvector.topkintNumber of top matching document chunks to retrieve during vector search. A higher value provides more context but increases latency.
dashvector.thresholdfloatMaximum vector distance allowed. Documents with a vector distance above this value are filtered out. Lower values return more precise matches; higher values return broader results.
dashvector.fieldstringField name in the DashVector collection that stores the document text.
Note

When the plug-in is enabled and Tracing Analysis is active, the document ID retrieved by the ai-rag plug-in is added to the attribute field in each span. Use this to trace which documents influenced a specific response.

Example

The following YAML shows a complete ai-rag plug-in configuration:

dashscope:
  apiKey: <your-dashscope-api-key>
  serviceFQDN: dashscope
  servicePort: 443
  serviceHost: dashscope.aliyuncs.com
dashvector:
  apiKey: <your-dashvector-api-key>
  serviceFQDN: dashvector
  servicePort: 443
  serviceHost: <your-dashvector-endpoint>
  collection: <your-collection-name>
  topk: 1
  threshold: 0.4
  field: raw

Replace the following placeholders with your actual values:

PlaceholderDescriptionExample
<your-dashscope-api-key>DashScope API key from the Alibaba Cloud consolesk-xxxxxxxxxxxxxxxx
<your-dashvector-api-key>DashVector API key from the Alibaba Cloud consolesk-xxxxxxxxxxxxxxxx
<your-dashvector-endpoint>Endpoint of your DashVector clustervrs-cn-xxxxxxx.dashvector.cn-hangzhou.aliyuncs.com
<your-collection-name>Name of the target collection in DashVectormy_knowledge_base

RAG in action: before and after

The following example uses the CEC-Corpus dataset -- 332 labeled emergency news reports, vectorized and stored in a DashVector collection.

Query: "Where did the Hainan rear-end collision occur? What is the reason? What are the casualties?"

Without ai-rag -- the LLM lacks external context and cannot answer:

I'm sorry. As an AI model, I cannot obtain and update specific information about news events in real time, including details such as locations, causes, and casualties. For such specific events, you are advised to view the latest news reports or official announcements for accurate information. You can also visit mainstream media websites, use news applications, or follow the announcements of relevant government departments to obtain the latest news.

With ai-rag -- the LLM retrieves relevant documents and returns a grounded response:

The rear-end collision in Hainan occurred at 37 kilometers from Wenchang to Haikou on the Haiwen Expressway. Regarding the specific cause of the accident, the traffic police department was still under investigation at the time. The casualties were 1 person died (the driver died on the spot) and 8 persons injured (including 2 children and 6 adults). All injured persons were rescued and sent to the hospital for treatment.

What's next

  • Explore other AI-related plug-ins available for cloud-native gateways.