When using the retrieval-augmented generation (RAG) feature of applications, you may encounter issues such as incomplete knowledge retrieval or inaccurate content. This topic describes how to optimize the RAG performance.
1. How RAG works
RAG is a technology that combines information retrieval and text generation. It provides relevant information from external knowledge bases for large language models (LLMs) when it generate answers.
The workflow of RAG includes several key stages such as parsing and chunking, vector storage, retrieval and recall, and response generation.
To optimize the RAG performance, you can try improving the following stages: parsing and chunking, retrieval and recall, and response generation.
2. Optimize RAG performance
2.1 Preparations
First, ensure that the documents imported into your knowledge base meet the following requirements:
Contains relevant knowledge: If the knowledge base contains no relevant information, the LLM may not be able to answer your questions. The solution is to update the knowledge base with the relevant knowledge.
Use Markdown format (recommended): The PDF content format is complex, and the parsing effect may not be good. We recommend that you convert it into a text document in Markdown, DOC, or DOCX format first. Then, use an LLM to organize the format.
If you want your application to return images in your document, convert the images into links.
Currently, video and audio content in documents is not supported.
Clear expression and structure without special styles: The layout of the document will significantly affect the RAG performance. For more information, see Content layout suggestions.
Use the language of the prompt: If the user's prompt uses English, your document should also be in English. You can also use two or more languages for necessary contents, such as terminologies.
Eliminate ambiguity: Use consistent expressions for concepts.
You can try inputting the document into an LLM and let it help you unify the standard. If the document is long, you can split it into multiple parts and input them one by one.
The above five preparations are the basis for further optimization. If you are ready, go on to delve into the various aspects of the RAG application.
2.2 Parsing and chunking
This section only introduces the configuration items that Model Studio supports.
As the first stage of RAG, the documents you import into the knowledge base will be parsed and chunked. The main purpose of chunking is to minimize interference information during embedding while maintaining semantic integrity. Therefore, the document splitting strategy you choose when creating the knowledge base will have a significant impact on the RAG performance. Inappropriate strategy may lead to the following problems:
Chunk too short | Chunk too long | Semantic integrity lost |
Chunks that are too short may lose meaning and fail to match during retrieval. | Chunks that are too long may contain unrelated information, resulting in irrelevant information being recalled. | Chunks that are broken down may result in missing content during recall. |
Therefore, in practice, you should make the chunks contain complete but not excessive information. We recommend:
When creating a knowledge base, use the intelligent splitting strategy.
After importing your document, manually check and correct the content of the chunks.
2.2.1 Intelligent splitting
Choosing the appropriate chunk length for your knowledge base is not easy, because you must consider multiple factors:
Type of document: For example, for professional literature, increasing the length usually retains more contextual information. But for social media posts, shortening the length can more accurately capture the meaning.
Complexity of prompt: Generally, if the user's prompt is complex and specific, you should the chunk length. Otherwise, shortening the length is better.
...
Of course, the above conclusions may not apply to all situations. You need to choose an appropriate tool and experiment repeatedly. For example, LlamaIndex provides evaluation for different chunking methods.
If you want to achieve good results in a short time, we recommend that you choose Intelligent Splitting when creating a knowledge base. This is a recommended strategy summarized by Model Studio team after extensive evaluation.
When applying this strategy, the knowledge base will:
First, use the system's built-in sentence identifier to divide the document into several paragraphs.
Based on the divided paragraphs, adaptively select chunking points according to semantic relevance rather than fixed length.
In the above process, the knowledge base will ensure as much of the semantic integrity as possible, avoiding unnecessary chunking. This strategy will be applied to all documents in the knowledge base, including documents imported later.
2.2.2 Correct chunk content
During actual chunking process, unexpected chunking or other issues may still occur. For example, spaces
in the text are sometimes parsed as %20
after chunking.
As a result, we recommend that after importing your document into the knowledge base, you manually check the semantic integrity and correctness of the chunk content. In case of unexpected chunking or other issues, you can directly edit and correct the chunk. After saving, the original content of the chunk will become invalid, and the new content will be used for knowledge base retrieval.
Note: Here, you are only modifying the chunks in the knowledge base, not the original document or data table in the remporary storage of Data Management. Therefore, if importing the document again, you need to manually check again.
2.3 Retrieval and recall
This section only introduces the configuration items that Model Studio supports.
The main problem in the retrieval and recall stage is: from the many chunks in the knowledge base, it is difficult to find the chunks that are most relevant to the prompt and contain the correct information. More detailed problems and strategies:
Problem | Strategy |
In multi-round conversations, the user's prompt may be incomplete or ambiguous. | Enable multi-round conversation rewriting. The knowledge base will automatically rewrite the user's prompt into a more complete form to better match the knowledge. |
The knowledge base contains documents of multiple categories. When retrieving in category A, the recall results contain chunks from category B. | Add tags to the documents. During retrieval, relevant documents will be filtered based on tags first. Only unstructured knowledge bases support document tags. |
The knowledge base contains multiple documents with similar structured content, such as all containing "function overview". When retrieving in the "function overview" of document A, the recall results contain the "function overview" of document B. | Extract metadata. The knowledge base will use metadata for structured search before vector retrieval to accurately find the target document and extract relevant information. Only unstructured knowledge bases support document metadata. |
The recall results from the knowledge base are incomplete and do not contain all relevant chunks. | Lower the similarity threshold and increase the number of recalled chunks. |
The recall results of the knowledge base contain many irrelevant chunks. | Increase the similarity threshold to exclude information with low similarity to the user's prompt. |
2.3.1 Multi-round conversation rewriting
In multi-round conversation scenario, the user may input prompts such as "Bailian Phone X1". Such simple prompt lacks the necessary contextual information during retrieval, because:
A phone product usually has multiple versions on sale at the same time.
For the same generation of products, manufacturers usually provide multiple storage capacity options, such as 128GB, 256GB, among others.
...
These key pieces of information may have been given by the user in previous conversations. If effectively utilized, they can help perform more accurate retrieval.
In this case, you can use the multi-round conversation rewriting feature. The system will automatically rewrite the user's prompt into a more complete form based on previous conversations.
For example, the user asks:
Bailian Phone X1.
When the feature is enabled, the system will rewrite the user's prompt based on the conversation history:
Please provide all available versions of Bailian Phone X1 in the product library and their specific parameter information.
Obviously, the rewritten prompt can help better understand the user's intent, making the answer more accurate.
When creating a knowledge base, you can select Recommend for Configuration Mode to automatically enable the feature. Or select Custom and manually turn on the feature.
Note that the feature is bound to the knowledge base and is only effective for queries related to the current knowledge base. Also, If you did not enable the feature when creating a knowledge base, you cannot enable it later unless you create a new knowledge base.
2.3.2 Tag filtering
This section only applies to unstructured knowledge bases.
Imagine that when using music apps, you may filter songs by the singer's name to quickly find songs of the same category (the same singer).
Similarly, adding tags to your unstructured documents can introduce additional structured information. This way, when an application retrieves the knowledge base, it will first filter documents based on tags, thereby improving retrieval accuracy and efficiency.
Currently, Model Studio supports setting tags in the following two ways:
Set tags when uploading documents: For steps in the console, see Import data. For the related API, see AddFile.
Edit tags on the Data Management page: For uploaded documents, you can click Tag to the right of the document to edit its tags.
Currently, Model Studio supports using tags in the following two ways:
When calling applications through API, you can specify tags in the request parameter
tags
.Set tags when editing the application in the console. This only applies to agent applications.
The settings here will apply to all subsequent conversations of the agent application.
2.3.3 Extract metadata
This section only applies to unstructured knowledge bases.
Embedding metadata into chunks can effectively enhance the contextual information of each segment. In specific scenarios, this method can significantly improve the RAG performance of unstructured knowledge bases.
Assume the following scenario:
A knowledge base contains many documents about mobile phones. The document names are the models of the phones (such as Bailian X1, Bailian Zephyr Z9, and others). All documents contain a "function overview" section.
The user inputs the following prompt:
Function overview of Bailian Phone X1.
Use hit test to see which chunks are actually recalled during retrieval.
When metadata is not used, since all documents contain "function overview", the knowledge base will recall chunks unrelated to Bailian Phone X1, such as Chunk 1 and 2 in the figure below. Their ranking is even higher than the actually needed chunk. This obviously affects the RAG performance.
The recall result of hit test only guarantees ranking, and the similarity value is for reference only. When the value difference is not large (within 5%), the recall probability is basically the same.
Next, set the phone name as metadata, meaning the chunks related to each document carry their respective phone name. Then, test again for comparison.
This time, the knowledge base will add a layer of structured search before vector retrieval, and the complete process is:
Extract metadata
{"key": "name", "value": "Bailian Phone X1"}
from the prompt.Find all chunks containing the "Bailian Phone X1" metadata based on the extracted metadata.
Perform vector retrieval to find the most relevant chunks.
The actual recall performance of hit test after enabling metadata is shown in the figure below. The knowledge base can now accurately find chunks related to "Bailian Phone X1" and containing "function overview".
In addition, a common application of metadata is to embed date information in chunks to filter recent content. For more usage, see Metadata extraction.
2.3.4 Adjust similarity threshold
When the knowledge base finds chunks related to the user's prompt, it will first send them to the rank model for reranking. The similarity threshold is used to filter the chunks returned after reranking. Only chunks with a similarity score higher than this threshold can be provided for the LLM.
Lowering this threshold may recall more chunks, including less relevant ones. Increasing it reduces the number of recalled chunks, and may discard relevant ones.
If set too high, all chunks are discarded and the LLM cannot get the information.
The best threshold depends on your scenario. You need to experiment with different similarity thresholds through hit testing, observe the recall results, and find the most suitable one for your needs.
Recommended steps | |
|
2.3.5 Increase recalled chunks
The number of recalled chunks is the K value in the multi-channel recall strategy. After filtering through the previous similarity threshold, if the number of chunks still exceeds K, the system will select the top K chunks with the highest similarity scores and provide them for the LLM. As a result, an inappropriate K value may also cause RAG to miss the correct chunks.
For example, in the example below, the user inputs the following prompt:
What are the advantages of Bailian X1 phone?
As shown in the figure below, the knowledge base contains 7 chunks related to the prompt (marked in green on the left side). But since the maximum recall number K, the chunks 5 and 6 are discarded.
The RAG system itself cannot determine how many chunks are needed to provide a "complete" answer. If the provided chunks are not enough, the LLM then will generate an incomplete response.
Extensive experiments show that in scenarios such as "list...", "summarize...", "compare X, Y...", providing more high-quality chunks (such as K=20) is more effective than providing only the top 10 or top 5 chunks. Although this may introduce noise information, if the quality of the chunks is high, the LLM can usually effectively offset the impact of noise information.
You can adjust the Number Of Recalled Chunks when editing your application. For instructions, see Retrieve Configuration (Optional).
Note that the number of recalled chunks is not necessarily the larger the better. Sometimes the total length of the recalled chunks, after assembly, may exceed the input length limit of the LLM. Some chunks will then be truncated. This will instead affect the RAG performance.
Therefore, we recommend Intelligent Assembly as the assembly strategy. This strategy will recall as many relevant chunks as possible without exceeding the maximum input length of the LLM.
2.4 Response generation
This section only introduces the configuration items that Model Studio supports.
At this point, the LLM can already generate the final response based on the user's prompt and the content recalled from the knowledge base. However, the returned result may still not meet your expectations.
Problem | Strategy |
The LLM does not understand the relationship between the knowledge and the user's prompt, and the response lacks coherence. | Choose an appropriate LLM that can effectively understand the relationship between the knowledge and the user's prompt. |
The response does not meet your requirements or is not comprehensive enough. | |
The response is not accurate and contains the LLM's general knowledge, not entirely based on the knowledge base. | Restrict the response range to the knowledge retrieved from the knowledge base. |
For similar prompts, you want the returned result to be the same or to be distinct each time. |
2.4.1 Choose appropriate LLM
Different LLMs have different capabilities in terms of instruction following, language support, long text, knowledge understanding, among others. This may lead to:
Model A
fails to understand the relationship between the retrieved knowledge and the prompt, resulting in an answer that cannot accurately respond to the prompt. After switching to Model B
, which has a larger parameter size or stronger professional capabilities, this problem may be solved.
You can Select Model according to your requirements when editing your application.
We recommend commercial Qwen models, such as Qwen-Max and Qwen-Plus. They have the latest capabilities and improvements compared to the open-source models.
If you only need the LLM to perform simple information query and summary, a smaller model, such as Qwen-Turbo will be enough.
If you need more complex logical reasoning, we recommend models with more parameters and stronger reasoning capabilities, such as Qwen-Plus or Qwen-Max.
If your question requires consulting many document chunks, we recommend models with a larger context length, such as Qwen-Plus or Qwen-Plus.
2.4.2 Optimize prompt template
You can influence the behavior of the LLM, such as how to use the retrieved knowledge, by adjusting your prompt. This can indirectly improve the RAG performance.
Here are some common methods:
Method 1: Limit the output
Provide context information, instructions, and expected output formats in the prompt template. For example, add the following output instructions:
If the provided information is not enough to answer the question, clearly state "Based on the existing information, I cannot answer this question". Do not fabricate answers.
This can reduce the chances of the LLM hallucination.
Method 2: Add examples
Use the few-shot prompting technique by adding question-and-answer examples that you want the LLM to imitate into the prompt. This guides the LLM to correctly use the retrieved knowledge. The following examples are based on Qwen-Plus:
Prompt template | User prompt and application response |
| |
|
Method 3: Add separators
If the chunks recalled are randomly mixed in the prompt template, the LLM may fail to understand the structure of the entire prompt. Therefore, you should clearly separate the prompt and the variable ${documents}
.
In addition, to ensure the effect, make sure that the variable ${documents}
appears only once in your prompt template.
Correct example | Incorrect example |
|
|
For more methods, see Prompt engineering.
2.4.3 Restrict response range
If you need the application responses to be strictly based on the knowledge base, excluding the influence of the LLM's own general knowledge, you can set the response range to Knowledge Base Only when editing the application.
In case of no relevant knowledge found in the knowledge base, you can set a fixed response.
Knowledge Base + LLM Knowledge | Knowledge Base Only |
The response integrates the knowledge retrieved from the knowledge base and the LLM's own general knowledge. | The response is strictly based on the knowledge retrieved from the knowledge base. |
For Knowledge Range Judgement, we recommend Search Threshold + LLM Judgment. This strategy will first filter potential chunks through the similarity threshold, and then a LLM referee will conduct an in-depth analysis based on the Judgment Prompt you set. This can further improve the accuracy of the determination.
The following is a judgment prompt for your reference. In addition, set a Fixed Reply in case no knowledge is found: Sorry, no relevant phone model was found.
# Judgment rules:
- If the entity involved in the query and the entity described in the document is exactly the same, the query matches with the document.
- The query is not mentioned in the document at all.
Knowledge hit | No knowledge hit |
For more information, see Retrieve Configuration (Optional).
2.4.4 Adjust LLM parameters
For similar prompts, if you want the returned result to be the same or to be distinct each time, you can adjust the LLM parameters when editing your application.
temperature controls the randomness of the generated content. A higher temperature results in more diversified text, while a lower temperature results in more predictable text.
Diversified text is suitable in creative writing (such as fiction and advertising copy), brainstorming, chatbot, and other scenarios.
Predictable text is suitable in scenarios with clear answers (such as question analysis, multiple-choice question, and factual query) or requiring accurate wording (such as technical documentation, legal text, news report, and academic paper).
Maximum Reply Length controls the maximum number of tokens generated by the LLM. Increase the value if you want to generate detailed descriptions. Decrease the value if you want to generate short answers.
Number of Rounds with Context controls the number of historical conversation rounds referenced by the LLM. If you set the value to 1, the model will not reference conversation history.