All Products
Search
Document Center

Alibaba Cloud Model Studio:Optimize RAG performance

Last Updated:Dec 04, 2025

If you encounter incomplete knowledge retrieval or inaccurate content while using the retrieval-augmented generation (RAG) feature in Alibaba Cloud Model Studio, this topic offers suggestions and examples to enhance RAG performance.

1. Introduction to the RAG workflow

RAG is a technique that combines information retrieval with text generation. It allows a model to use relevant information from an external knowledge base when generating answers.

Its workflow includes several key stages, including parsing and chunking, vector storage, retrieval and recall, and answer generation.

image.jpeg

This topic discusses how to optimize RAG performance by focusing on the parsing & chunking, retrieval & recall, and answer generation stages.

2. Optimizing RAG performance

2.1 Preparations

First, ensure that the documents you import into the knowledge base meet the following requirements:

  • Include relevant knowledge: If the knowledge base lacks relevant information, the model will likely be unable to answer related questions. To solve this, update the knowledge base and add the necessary information.

  • Use Markdown format (recommended): PDF content has a complex format, and the parsing result may not be ideal. We recommend that you first convert PDFs to a text document format, such as Markdown, DOC, or DOCX. For example, you can use the DashScopeParse tool in Model Studio to convert a PDF to Markdown format, and then use a model to organize the format. This can also achieve good results. .

    How should I handle illustrations in documents?

    The knowledge base currently does not support parsing video or audio content in documents.
  • Clear content expression, reasonable structure, and no special styles: The content layout of the document also significantly affects RAG performance. For more information, see How should documents be formatted to benefit RAG?.

  • Match the prompt language: If user prompts are mostly in a language other than the source content, such as English, we recommend that your document content also be in that language. If necessary, for example, for professional terms in the document, you can use two or more languages.

  • Eliminate entity ambiguity: Standardize expressions for the same entity in the document. For example, you can standardize "ML", "Machine Learning", and "machine learning" to "machine learning".

    You can input the document into a model to help you standardize it. If the document is long, you can split it into multiple parts and input them one by one.

These five preparations are the basis for ensuring the effectiveness of subsequent RAG optimization. After you complete them, you can start to understand and optimize each part of the RAG application in depth.

2.2 Parsing and chunking stage

This section introduces only the configuration items that Model Studio supports for optimization in the RAG chunking stage.

First, the documents you import into the knowledge base are parsed and chunked. The main purpose of chunking is to minimize the impact of interfering information during the subsequent vectorization process while maintaining semantic integrity. Therefore, the document chunking strategy you choose when you create a knowledge base has a significant impact on RAG performance. If the chunking method is inappropriate, it may lead to the following problems:

Text chunks are too short

Text chunks are too long

Obvious semantic truncation

image

image

image

Chunks that are too short can have missing semantic information, leading to failed matches during retrieval.

Chunks that are too long can include irrelevant topics, causing unrelated information to be returned during recall.

Chunks with forced semantic truncation can cause content to be missing from the recall.

Therefore, in practical applications, you should try to make the information in text chunks complete while avoiding excessive interfering information. Model Studio recommends that you perform the following actions:

  1. When creating a knowledge base, select Smart chunking for the document chunking method.

  2. After successfully importing the document into the knowledge base, manually check and correct the content of the text chunks.

2.2.1 Intelligent chunking

Choosing the right text chunk length for your knowledge base is not easy because you must consider multiple factors, such as:

  • Document type: For example, for professional literature, increasing the length usually helps to retain more contextual information. For social media posts, shortening the length can capture semantics more accurately.

  • Prompt complexity: Generally, if the user's prompt is complex and specific, you may need to increase the length. Otherwise, shortening the length is more appropriate.

These conclusions do not necessarily apply to all situations. You need to choose the right tools and experiment repeatedly to find the right text chunk length. For example, LlamaIndex provides evaluation functions for different chunking methods. However, this process can be difficult.

If you want to achieve good results quickly, we recommend that you select Intelligent Splitting for Document Splitting when you create a knowledge base.

image

When this strategy is applied, the knowledge base performs the following steps:

  1. First, use the system's built-in sentence delimiters to divide the document into several paragraphs.

  2. Based on the divided paragraphs, adaptively select chunking points for chunking based on semantic relevance (semantic chunking), rather than chunking based on a fixed length.

During this process, the knowledge base strives to ensure the semantic integrity of each part of the document and tries to avoid unnecessary divisions and chunking. This strategy is applied to all documents in this knowledge base, including subsequently imported documents.

2.2.2 Correcting text chunk content

In the actual chunking process, unexpected chunking or other problems can still occur. For example, spaces in the text are sometimes parsed as %20 after chunking.

image

Therefore, Model Studio recommends that you perform a manual check after you import the document into the knowledge base to confirm the semantic integrity and correctness of the text chunk content. If you find unexpected chunking or other parsing errors, you can edit the text chunk directly to correct it. After you save your changes, the original content of the text chunk becomes invalid, and the new content is used for knowledge base retrieval.

Note that this action modifies only the text chunks in the knowledge base, not the original document or data table in your data management (temporary storage). Therefore, if you import it into the knowledge base again later, you must perform another manual check and correction.

2.3 Retrieval and recall stage

This section introduces only the configuration items that Model Studio supports for optimization in the retrieval and recall stage.

The main problem in the retrieval and recall stage is that it is difficult to find the text chunks that are most relevant to the prompt and contain the correct answer from the numerous text chunks in the knowledge base. This problem can be broken down into the following types:

Problem type

Improvement strategy

In multi-round conversation scenarios, the user's prompt may be incomplete or ambiguous.

Enable multi-round conversation rewriting. The knowledge base automatically rewrites the user's prompt into a more complete form to better match the knowledge.

The knowledge base contains documents from multiple categories. When you search in a document from Category A, the recall results include text chunks from other categories, such as Category B.

We recommend that you add tags to the documents. When the knowledge base retrieves information, it first filters relevant documents based on tags.

Only document search knowledge bases support adding tags to documents.

The knowledge base contains multiple documents with similar structures. For example, they all contain a "Function Overview" section. You want to search in the "Function Overview" section of Document A, but the recall results include information from other similar documents.

Extract metadata. The knowledge base uses metadata for a structured search before the vector search to accurately find the target document and extract relevant information.

Only document search knowledge bases support creating document metadata.

The recall results of the knowledge base are incomplete and do not include all relevant text chunks.

We recommend that you lower the similarity threshold and increase the number of recalled chunks to recall information that should have been retrieved.

The recall results of the knowledge base contain a large amount of irrelevant text chunks.

We recommend that you increase the similarity threshold to exclude information with low similarity to the user's prompt.

2.3.1 Multi-round conversation rewriting

In a multi-round conversation, a user might ask a question with a short prompt, such as "Bailian Phone X1". This prompt may cause the RAG system to lack the necessary context during retrieval for the following reasons:

  • A mobile phone product usually has multiple generations available for sale at the same time.

  • For the same generation of a product, the manufacturer usually offers multiple storage capacity options, such as 128 GB and 256 GB.

    ...

The user may have already provided this key information in previous turns of the conversation. Using this information effectively can help RAG retrieve more accurate information.

To address this situation, you can use the Multi-round Conversation Rewriting feature in Model Studio. The system automatically rewrites the user's prompt into a more complete form based on the conversation history.

For example, a user asks:

Bailian Phone X1.

When the multi-round conversation rewriting feature is enabled, the system rewrites the user's prompt based on their conversation history before retrieval, as shown in the following example:

Provide all available versions of Bailian Phone X1 in the product library and their specific parameters.

This rewritten prompt helps RAG better understand the user's intent and provide a more accurate response.

The figure below shows how to enable the multi-round conversation rewriting feature. This feature is also enabled when you select Recommended.

image

Note that the multi-round conversation rewriting feature is attached to the knowledge base. Once enabled, it applies only to queries related to the current knowledge base. If you do not enable this configuration when you create the knowledge base, you cannot enable it later unless you recreate the knowledge base.

2.3.2 Tag filtering

The content of this section applies only to document search knowledge bases.

When you use a music app, you might filter songs by the artist's name to quickly find songs from that artist.

Similarly, adding tags to your unstructured documents introduces additional structured information. This allows the application to filter documents based on tags when retrieving from the knowledge base, which improves retrieval accuracy and efficiency.

Model Studio supports two methods for setting tags:

  • Set tags when uploading documents: For the steps in the console, see Import data.

  • Edit tags on the Data Management page: For uploaded documents, you can click Tag on the right of the document to edit.

    image

Model Studio supports two methods for using tags:

  • When you call an application through an API, you can specify tags in the tags request parameter.

  • Set tags when editing an application in the console. However, this method is applicable only to agent applications.

    Note that this setting applies to all subsequent questions and answers for this agent application.

    image

2.3.3 Extracting metadata

The content of this section applies only to document search knowledge bases.

Embedding metadata into text chunks can effectively enhance the contextual information of each chunk. In specific scenarios, this method can significantly improve the RAG performance of document search knowledge bases.

Consider the following scenario:

A knowledge base contains many mobile phone product description documents. The document names are the phone models (such as Bailian X1, Bailian Zephyr Z9, etc.), and all documents include a "Function Overview" chapter.

If metadata is not enabled for this knowledge base, a user might enter the following prompt for retrieval:

Function overview of Bailian Phone X1.

A retrieval test shows which chunks were recalled, as shown in the following figure. Because all documents contain a "Function Overview" section, the knowledge base recalls some text chunks that are unrelated to the query entity (Bailian Phone X1) but are similar to the prompt, such as Chunk 1 and Chunk 2 in the figure. Their rankings are even higher than the text chunk that is actually needed. This clearly affects RAG performance.

The recall results of the retrieval test only guarantee the ranking. The absolute value of the similarity score is for reference only. When the difference in absolute values is small (within 5%), the recall probability can be considered the same.

image

Next, set the phone name as metadata. This is equivalent to adding the corresponding phone name information to the text chunks of each document. You can follow the steps in metadata extraction and then run the same test for comparison.

image

At this point, the knowledge base adds a layer of structured search before the vector search. The process is as follows:

  1. Extract metadata {"key": "name", "value": "Bailian Phone X1"} from the prompt.

  2. Based on the extracted metadata, find all text chunks that contain the "Bailian Phone X1" metadata.

  3. Then, perform a vector (semantic) search to find the most relevant text chunks.

The actual recall effect of the retrieval test after enabling metadata is shown in the following figure. As you can see, the knowledge base can now accurately find the text chunk that is related to "Bailian Phone X1" and contains "Function Overview".

image

Another common application of metadata is to embed date information in text chunks to filter for recent content. For more information about metadata usage, see metadata extraction.

2.3.4 Adjusting the similarity threshold

When the knowledge base finds text chunks related to the user's prompt, it first sends them to the Rank model for reordering. You can configure this in the Custom parameter settings when you create the knowledge base. The similarity threshold is then used to filter the reordered text chunks. Only text chunks with a similarity score that exceeds this threshold are likely to be provided to the model.

image

Lowering this threshold may recall more text chunks, but it may also cause some less relevant text chunks to be recalled. Increasing this threshold may reduce the number of recalled text chunks.

If it is set too high, as shown in the following figure, it may cause the knowledge base to discard all relevant text chunks. This limits the model's ability to obtain sufficient background information to generate an answer.

image

There is no single best threshold, only the one that is most suitable for your scenario. You need to experiment with different similarity thresholds through retrieval tests, observe the recall results, and find the solution that best suits your needs.

Recommended steps for retrieval testing

  1. Design test cases that cover common customer questions.

  2. Choose an appropriate similarity threshold based on the specific application scenario of the knowledge base and the quality of the previously imported documents.

  3. Perform a retrieval test to view the knowledge base recall results.

  4. Based on the recall results, readjust the similarity threshold of your knowledge base. For specific operations, see Create and use a knowledge base.

image

2.3.5 Increase the number of recalled chunks

The number of recalled chunks is the K value in the multi-channel recall strategy. After the similarity threshold filtering, if the number of text chunks exceeds K, the system selects the K text chunks with the highest similarity scores to provide to the model. Because of this, an inappropriate K value may cause RAG to miss the correct text chunks, which affects the model's ability to generate a complete answer.

For example, in the following case, the user retrieves with the following prompt:

What are the advantages of the Bailian X1 phone?

As you can see in the following diagram, there are a total of 7 text chunks in the target knowledge base that are related to the user's prompt and should be returned. These are marked in green on the left. However, because this number exceeds the currently set maximum number of recalled chunks (K), the text chunks containing advantage 5 (ultra-long standby) and advantage 6 (clear photos) are discarded and not provided to the model.

image

Because RAG cannot determine how many text chunks are needed to provide a complete answer, the model will generate an answer based on the provided chunks, even if they are incomplete.

Many experiments show that in scenarios that require lists, summaries, or comparisons, providing more high-quality text chunks (for example, K=20) to the model is more effective than providing only the top 10 or top 5. Although this may introduce noise, if the text chunk quality is high, the model can usually offset the impact of the noise.

You can adjust the Number Of Recalled Chunks when you edit an application.

image

Note that a larger number of recalled chunks is not always better. Sometimes, after the recalled text chunks are assembled, their total length may exceed the input length limit of the model. This causes some text chunks to be truncated, which in turn affects RAG performance.

Therefore, we recommend that you select Intelligent Assembly. This strategy recalls as many relevant text chunks as possible without exceeding the maximum input length of the model.

2.4 Answer generation stage

This section introduces only the configuration items that Model Studio supports for optimization in the answer generation stage.

At this point, the model can generate the final answer based on the user's prompt and the content retrieved and recalled from the knowledge base. However, the returned result may still not meet your expectations.

Problem type

Improvement strategy

The model does not understand the relationship between the knowledge and the user's prompt. The answer seems to be a clumsy combination of text.

We recommend that you choose a suitable model to effectively understand the relationship between the knowledge and the user's prompt.

The returned result does not follow the instructions or is not comprehensive.

We recommend that you optimize the prompt template.

The returned result is not accurate enough. It contains the model's own general knowledge and is not entirely based on the knowledge base.

We recommend that you enable rejection to restrict answers to only the knowledge retrieved from the knowledge base.

For similar prompts, you want the results to be either consistent or varied.

We recommend that you adjust the model parameters.

2.4.1 Choose a suitable model

Different models have different capabilities in areas such as instruction following, language support, long text, and knowledge understanding. This can lead to the following situation:

Model A fails to effectively understand the relationship between the retrieved knowledge and the prompt, so the generated answer cannot accurately respond to the user's prompt. Switching to Model B, which has more parameters or stronger professional capabilities, may solve this problem.

You can Select A Model when you edit an application based on your actual needs.

image

You can Select a Model when editing an application in Model Studio to meet your specific requirements. We recommend choosing a commercial model from the Qwen series, such as Qwen-Max, Qwen-Plus, or other models. These commercial models provide the latest capabilities and improvements compared to their open-source versions.

  • For simple information query and summarization, a model with a small number of parameters is sufficient, such as Qwen-Turbo.

  • If you want RAG to perform more complex logical reasoning, we recommend choosing a model with a larger number of parameters and stronger reasoning capabilities, such as Qwen-Max.

  • If your question requires consulting many document chunks, we recommend choosing a model with a longer context length, such as Qwen-Plus.

  • If the RAG application you are building is for a non-general domain, such as the legal field, we recommend using a model trained for that specific domain, such as Qwen-Legal.

2.4.2 Optimizing the prompt template

A model predicts the next token based on the given text. This means you can influence the behavior of the model by adjusting the prompt, such as how it uses the retrieved knowledge. This can indirectly improve RAG performance.

The following are three common optimization methods:

Method 1: Constrain the output content

You can provide contextual information, instructions, and the expected output format in the prompt template to instruct the model. For example, you can add the following output instruction to require the model to:

If the information provided is not sufficient to answer the question, please state clearly, "Based on the existing information, I cannot answer this question." Do not invent an answer.

This reduces the chance of the model generating hallucinations.

Method 2: Add examples

Use the Few-Shot Prompting method to add question-and-answer examples to the prompt for the model to imitate. This guides the model to correctly use the retrieved knowledge. The following example uses Qwen-Plus.

Prompt template

User prompt and the result returned by the application

# Requirement
Please extract the technical specifications from the text below and display them in JSON format.
${documents}

image

# Requirement
Please extract the technical specifications from the text below and display them in JSON format. Please strictly follow the fields given in the example.
${documents}

# Example
## Input: Stardust S9 Pro, a groundbreaking 6.9-inch 1440 x 3088 pixel under-screen camera design, brings a boundless visual experience. The top configuration of 512 GB storage and 16 GB RAM, combined with a 6000 mAh battery and 100 W fast charging technology, allows performance and battery life to go hand in hand, leading the technological trend. Reference price: 5999 - 6499.
## Output: { "product":"Stardust S9 Pro", "screen_size":"6.9inch", "ram_size": "16GB", "battery":"6000mAh" }

image

Method 3: Add content delimiters

If the retrieved text chunks are randomly mixed in the prompt template, it is difficult for the model to understand the structure of the entire prompt. Therefore, we recommend that you clearly separate the prompt and the ${documents} variable.

In addition, to ensure the best effect, make sure that the ${documents} variable appears only once in your prompt template. For more information, see the correct example on the left in the following table.

Correct example

Incorrect example

# Role
You are a customer service representative, focusing on analyzing and solving user problems, and providing accurate solutions by retrieving the knowledge base.

# Requirements
**Return the result directly**: Please provide a direct result based on the user's prompt and the content in the knowledge base, without
inference.
**Do not include specific contact information in the returned result**: The returned result should only include a summary of the user's prompt, the
names of relevant on-duty personnel, and their responsibilities.
**Default contact**: If no relevant on-duty personnel can be found, please return "On-duty representative today: Model Studio Customer Service 01".

# Knowledge Base
Please remember the following materials, they may be helpful in answering the question.
${documents}
# Role
You are a customer service representative, focusing on analyzing and solving user problems, and providing accurate solutions by retrieving the knowledge base.
Please use the information in ${documents} to assist in answering.

# Requirements
**Return the result directly**: Please provide a direct result based on the user's prompt and the content in the knowledge base, without
inference.
**Do not include specific contact information in the returned result**: The returned result should only include a summary of the user's prompt, the
names of relevant on-duty personnel, and their responsibilities.
**Default contact**: If no relevant on-duty personnel can be found, please return "On-duty representative today: Model Studio Customer Service 01".

# Knowledge Base
Please remember the following materials, they may be helpful in answering the question.
${documents}

To learn more about prompt optimization methods, see Prompt engineering.

2.4.3 Enable rejection

If you want the results returned by your application to be based strictly on the knowledge retrieved from the knowledge base, and to exclude the influence of the model's own general knowledge, you can set the answer scope to Knowledge base only when you edit the application.

For cases where no relevant knowledge is found in the knowledge base, you can also set a fixed, automatic reply.

Answer scope: Knowledge Base + model Knowledge

Answer scope: Knowledge Base Only

image

image

The result returned by the application will be a combination of knowledge retrieved from the knowledge base and the model's own general knowledge.

The result returned by the application will be strictly based on the knowledge retrieved from the knowledge base.

To determine the knowledge scope, we recommend that you choose the Search Threshold + LLM Judgment method. This strategy first filters potential text chunks using a similarity threshold. Then, a model acts as a referee, using the Judgment Prompt that you set to conduct an in-depth analysis of the relevance. This further improves the accuracy of the judgment.

image

The following is an example of a judgment prompt for your reference. In this example, when no relevant knowledge is found in the knowledge base, a fixed reply is set: Sorry, no relevant phone model was found.

# Judgment rules:
- The premise for a match between the question and the document is that the entity involved in the question is exactly the same as the entity described in the document.
- The question is not mentioned at all in the document.

User prompt and the result returned by the application (when knowledge is hit)

User prompt and the result returned by the application (when knowledge is not hit)

image

image

2.4.4 Adjusting model parameters

For similar prompts, if you want the results to be consistent or varied each time, you can modify the Parameters to adjust the model parameters when you edit the application.

image

The temperature parameter in the preceding figure controls the randomness of the content generated by the model. The higher the temperature, the more diverse the generated text. Conversely, the more deterministic the generated text.

  • Diverse text is suitable for creative writing (such as novels, advertising copy), brainstorming, chat applications, and other scenarios.

  • Deterministic text is suitable for scenarios with clear answers (such as problem analysis, multiple-choice questions, fact-finding) or requiring precise wording (such as technical documents, legal texts, news reports, academic papers).

The other two parameters are:

Maximum response length: This parameter controls the maximum number of tokens generated by the model. You can increase this value to generate detailed descriptions or decrease it to generate short answers.

Number of context turns: This parameter controls the number of historical conversation turns that the model refers to. When set to 1, the model does not refer to historical conversation information when answering.

3. FAQ

How should documents be formatted to benefit RAG?

Document content layout recommendations

  • The headings at all levels of the document are distinct, and the content under each heading is clearly expressed.

  • Try not to have watermarks in the document.

  • Try not to have further levels under an item in the middle of a list.

  • Try not to have tables or pictures in the document. Complex tables will affect the overall document parsing result.

Document heading levels are not clear enough - Example

Original document

The first-level heading is "IV. Prize usage rules:", and the content includes "Prize 1:..." and "Prize 2:...".

image.png

Problems that will occur after processing

"Prize 2:..." is parsed as a subheading of "Prize 1:...". We recommend setting "Prize 1:..." and "Prize 2:..." as numbered, second-level headings in the document.

Document has watermarks - Example

Original document

The document has a watermark, and there are three items in total.

image.png

Problems that will occur after processing

The third item is divided into a separate chunk. However, because the watermark is recognized as text, the order of the items may be disrupted.

Further levels under an item in the middle of a list - Example

Original document

Under the first-level heading "Activity rules" is an ordered list, in which the third item "Activity introduction" is another list (divided into a and b).

image.png

Problems that will occur after processing

The ordered list under the first-level heading "Activity rules" contains a third item, "Activity introduction", which is another list. This causes "Activity introduction" to be treated as a second-level heading, and all subsequent content is mistakenly treated as content under this new heading. We recommend that you do not nest lists. If you must, try to place the nested list at the end of the parent list.

A good example

  • The content under each heading is relatively independent and clear.

  • No watermarks.

  • Under the heading is a list, but there are no further levels under the list.

  • No tables or pictures.