In a retrieval-augmented generation (RAG) architecture, a knowledge base serves as an external private data source that provides context for a large language model (LLM). The quality of this data directly affects the accuracy of the generated results. You can create a knowledge base once and reuse it in multiple application flows, which greatly improves development efficiency. Knowledge bases also support manual and scheduled updates to ensure they are current and accurate.
How it works
When you create a knowledge base and update its index, the system starts a workflow that uses a PAI-DLC task to process files in the OSS data source. This workflow consists of three core steps:
Preprocessing and chunking: The system automatically reads the specified source files in OSS. It preprocesses and splits them into text chunks suitable for retrieval. Structured data is chunked by row. Images are not chunked.
Vectorization (Embedding): The system calls an embedding model to convert each text chunk or image into a numerical vector that represents its semantic meaning.
Storage and indexing: The system stores the generated vector data in a vector database and creates an index for efficient retrieval.
Core flow: From creation to use
Step 1: Create a knowledge base
Go to LangStudio and select a workspace. On the Knowledge Base tab, click Create Knowledge Base. Configure the parameters as described below.
Basic configuration
Parameter | Description |
Knowledge base name | Enter a custom name for the knowledge base. |
Data source OSS path | The OSS directory where the knowledge base data is located. |
Output OSS path | Stores the intermediate results and index information generated from document parsing. The final output depends on the selected vector database type. Important If the Instance RAM Role set for the runtime is the PAI default role, for this parameter, we recommend that you use the same OSS Bucket as the current workspace default storage path and specify any directory within that bucket. |
Knowledge base type | Select a type based on your files. The supported file formats for each type are as follows:
|
Configuration for specific types
Document parsing and chunking configuration (for document knowledge bases)
Text chunk size: The maximum number of characters for each text chunk. The default is 1024 characters.
Text chunk overlap: The number of overlapping characters between adjacent text chunks. This ensures retrieval coherence. The default is 200 characters. We recommend setting the overlap size to 10% to 20% of the chunk size.
NoteFor more information, see Tuning chunking parameters.
Field Configuration (for structured data knowledge bases): Upload a file, such as animal.csv, or add fields manually. You can specify which data fields are used for indexing and retrieval.
Embedding model and database
Select an embedding model service and a vector database for the knowledge base. If the desired option is not in the drop-down list, you can create one. For more information, see Connection Configuration.
Knowledge base type | Supported embedding types | Supported vector database types |
Documents |
|
|
Structured data | ||
Images |
|
|
Vector database recommendations:
Production environment: We recommend using Milvus and Elasticsearch. They support processing large-scale vector data.
Test environment: We recommend using FAISS because it does not require a separate database (knowledge base files and generated index files are stored in the Output OSS Path). It is suitable for functional testing or handling small-scale files. Large file volumes will significantly impact retrieval and processing performance.
Runtime
Select a runtime to perform operations such as document chunk preview and retrieval testing. These operations require access to the vector database and embedding service.
Note the following runtime settings:
If you access the vector database or embedding service through an internal endpoint, ensure that the runtime's VPC is the same as the VPCs of the database and service, or that the VPCs are connected.
If you select a custom role for the instance RAM role, you must grant the role permissions to access OSS. We recommend that you grant the AliyunOSSFullAccess permission. For more information, see Granting permissions to a RAM role.
If your runtime version is earlier than 2.1.4, it may not appear in the drop-down list. To resolve this issue, create a new runtime.
Step 2: Upload files
Method 1: Upload files directly to the OSS path that is configured as the data source for the knowledge base.
Method 2: On the Knowledge Base tab, click the name of the target knowledge base to open its details page. On the Documents or Images tab, upload your files or images.

Step 3: Build the index
Update the index. After uploading the files, click Update Index in the upper-right corner. This submits a PAI workflow task that executes a DLC job to preprocess, chunk, and vectorize the files in the OSS data source and then build an index. The task parameters are described below:
Parameter
Description
Computing resources
The computing resources required to execute the tasks. You can use public resources, or Lingjun resources and general computing resources through resource quotas.
When updating the index for an image knowledge base, the number of nodes must be greater than 2.
For high-quality extraction of charts from complex PDFs, we recommend using GPU resources with a driver version of 550 or higher when updating the index. The system automatically calls a model for chart recognition or OCR for tasks that meet the resource type and driver version requirements. The images are stored in the
chunk_imagesdirectory of the output path. When used in an application flow, images in the text are replaced with an HTML<img>tag, such as<img src="temporary_signed_URL">.
VPC configuration
If you access the vector database or embedding service through an internal network, ensure that the selected VPC is the same as or connected to the VPCs of these services.
Embedding configuration
Maximum concurrency (for image knowledge bases): The number of concurrent requests to the embedding service. Because the Model Studio multimodal model service limits requests to 120 per minute, increasing this concurrency may trigger throttling.
Batch size (for document/structured data knowledge bases): The number of text chunks processed in each batch during vectorization. Setting an appropriate value based on the QPS limit of the model service can improve processing speed.
Preview file chunks or images. After the index update task is successful, you can preview the document chunks or images.
NoteFor document chunks already stored in Milvus, you can individually set their status to Enabled or Disabled. Disabled chunks will not be retrieved.

Step 4: Test the retrieval effect
After updating the index, switch to the Retrieval Test tab. Enter a query and adjust the retrieval parameters to test the retrieval results. For an image knowledge base, the test returns a list of images.

Retrieval parameters:
Top K: The maximum number of relevant text chunks to retrieve from the knowledge base. The value can range from 1 to 100.
Score threshold: The similarity score threshold, which ranges from 0 to 1. Only chunks with scores above this threshold are returned. A higher value indicates a stricter similarity requirement between the text and the query.
Retrieval Pattern: The default pattern is Dense (vector search). If you want to use Hybrid search (vector and keyword), your vector database must be Milvus 2.4.x or later, or Elasticsearch. For more information about how to select a retrieval pattern, see Select a retrieval pattern.
Metadata filter condition: Filters the retrieval scope using metadata to improve accuracy. For more information, see Using metadata.
Query rewrite: Uses an LLM to optimize user queries that are vague, colloquial, or contextual. This process clarifies the user's intent by making the queries clearer and more complete, which improves retrieval accuracy. For more information about usage scenarios, see Query rewrite.
Result Reranking: Reranks the initial retrieval results using a reranking model to move the most relevant results to the top. For more information about usage scenarios, see Result Reranking.
NoteResult reranking requires a reranking model. Supported model service connection types include Model Studio LLM service, OpenSearch model service, and general Reranker model service.
Step 5: Use in an application flow
After you complete the test, you can use a knowledge base for information retrieval in an application flow. In the knowledge base node, you can enable the query rewrite and result reranking features and view the rewritten query in the trace.

The result is a List[Dict], where each Dict contains the keys content and score, which represent the content of a document chunk and its similarity score with the input query, respectively.
Retrieval optimization
Tune chunk parameters: Build a foundation for retrieval
Guiding principles
Model context limit: Ensure that the chunk size does not exceed the token limit of the embedding model to avoid information truncation.
Information integrity: Chunks should be large enough to contain complete semantics but small enough to avoid including too much information, which can reduce the precision of similarity calculations. If the text is organized by paragraphs, you can configure the chunking to align with paragraphs to avoid arbitrary splits.
Maintain continuity: Set an appropriate overlap size, we recommend 10% to 20% of the chunk size, to effectively prevent context loss caused by splitting key information at chunk boundaries.
Avoid repetitive interference: Excessive overlap can introduce duplicate information and reduce retrieval efficiency. You need to find a balance between information integrity and redundancy.
Debugging suggestions
Iterative optimization: Start with initial values, such as a chunk size of 300 and an overlap of 50. Then, continuously adjust and experiment with these values based on the actual retrieval and question-answering (Q&A) results to find the optimal settings for your data.
Natural language boundaries: If the text has a clear structure, such as chapters or paragraphs, prioritize splitting based on these natural language boundaries to maximize semantic integrity.
Quick optimization guide
Issue | Optimization suggestion |
Irrelevant retrieval results | Increase the chunk size and decrease the chunk overlap. |
Incoherent context in results | Increase the chunk overlap. |
No suitable matches found (low recall rate) | Slightly increase the chunk size. |
High computing or storage overhead | Decrease the chunk size and reduce the chunk overlap. |
The following table lists the recommended chunk and overlap sizes for different types of text.
Text type | Recommended chunk size | Recommended chunk overlap |
Short text (FAQs, summaries) | 100 to 300 | 20 to 50 |
Regular text (news, blogs) | 300 to 600 | 50 to 100 |
Technical documents (APIs, papers) | 600 to 1024 | 100 to 200 |
Long documents (legal, books) | 1024 to 2048 | 200 to 400 |
Select a retrieval mode: Balance semantics and keywords
The retrieval pattern determines how the system matches your query with the content in the knowledge base. Different patterns have unique advantages and disadvantages and are suitable for different scenarios.
Dense (vector) retrieval: Excels at understanding semantics. It converts both the query and documents into vectors and determines semantic relevance by calculating the similarity between these vectors.
Sparse (keyword) retrieval: Excels at exact matching. It is based on traditional term frequency models, such as BM25, and calculates relevance based on the frequency and position of keywords in a document.
Hybrid retrieval: Combines both. It merges the results of vector and keyword retrieval and reranks them using algorithms such as Reciprocal Rank Fusion (RRF) or weighted fusion, such as linear weighting or model ensemble.
Retrieval mode | Pros and cons | Scenarios |
Dense (vector) retrieval |
|
|
Sparse (keyword) retrieval |
|
|
Hybrid retrieval |
|
|
Use metadata: Filter retrieval
Value of metadata filtering
Precise retrieval, less noise: Metadata can be used as a filter condition or sorting basis during retrieval. By filtering with metadata, you can exclude irrelevant documents and prevent the generation model from receiving unrelated content. For example, if a user asks to "find sci-fi novels written by Liu Cixin," the system can use the metadata conditions
author=Liu Cixinandcategory=sci-fito directly locate the most relevant documents.Improved user experience
Supports personalized recommendation: You can use metadata to provide personalized recommendations based on a user's historical preferences, such as a preference for "sci-fi" documents.
Enhances result interpretability: Including document metadata, such as author, source, and date, in the results helps users judge the credibility and relevance of the content.
Supports multilingual or multimodal expansion: Metadata such as "language" and "media type" makes it easy to manage complex knowledge bases that contain multiple languages and mixed text and images.
How to use
Feature limitations:
Runtime image version: Must be 2.1.8 or later.
Vector database: Only Milvus and Elasticsearch are supported.
Knowledge base type: Supports documents or structured data. Images are not supported.
Configure metadata variables. For knowledge bases that use only Milvus, go to the Metadata section on the Overview tab and click Edit to configure variables, such as
author. Do not use system-reserved fields.
Tag documents. On the document chunk details page, click Edit Metadata to add a metadata variable and value, such as
author=Alex. You can then return to the overview page to view the metadata reference status and the number of values.
Test the filter. On the Retrieval Test tab, add a metadata filter condition and run a test.

Note: The documents retrieved in the image are the documents that were tagged in Step 2.
Use in an application flow. Configure the metadata filter condition in the knowledge base node.

Query rewrite and result reranking: Optimize the retrieval chain
Query rewrite
Uses an LLM to rewrite a user's vague, colloquial, or context-dependent query into a clearer, more complete, and independent question. This improves the accuracy of the subsequent retrieval.
Recommended scenarios:
The user's query is vague or incomplete, such as "When was he born?" without context.
In a multi-turn conversation, the query depends on context, such as "What did he do after that?".
The retriever or LLM is not powerful enough to accurately understand the original query.
You are using traditional inverted index retrieval, such as BM25, instead of semantic retrieval.
Scenarios where it is not recommended:
The user's query is already very clear and specific.
The LLM is powerful and can accurately understand the original query.
The system requires low latency and cannot tolerate the additional delay caused by rewriting.
Result reranking
Reranks the initial results returned by the retriever to display the most relevant documents first, which improves the ranking quality.
Recommended scenarios:
The quality of results from the initial retriever, such as BM25 or DPR, is unstable.
The ranking of retrieval results is critical, for example, when Top-1 accuracy is required in search or Q&A systems.
Scenarios where it is not recommended:
System resources are limited and cannot handle the additional inference overhead.
The initial retriever is already powerful enough, and reranking provides limited improvement.
Response time is extremely critical, such as in real-time search scenarios.
Operations and management
Configure timed scheduling: Automatic updates
The scheduled update feature depends on DataWorks. Make sure that you have activated the service. If the service is not activated, see Activate DataWorks Service.
On the knowledge base details page, click in the upper-right corner, complete the configuration, and click Submit.

View scheduling configurations and recurring tasks
After you submit the form, the system automatically creates a scheduled workflow for the knowledge base in the DataWorks Data Development center. The system then publishes the workflow as a recurring task in the DataWorks Operation Center. Recurring tasks take effect on the next day (T+1). The DataWorks recurring task updates the knowledge base at your configured time. You can view the scheduling configuration and recurring tasks on the scheduling configuration page of the knowledge base.

Scheduled update parameters:
Scheduling Cycle: Defines how often the node runs in the production environment. This setting determines the number of recurring instances and their runtimes.
Scheduling Time: Defines the specific time when the node runs.
Timeout Definition: Defines the maximum duration that the node can run before it times out and exits.
Effective Date: Defines the time range during which the node is automatically scheduled to run. The node is not automatically scheduled outside this range.
Scheduling Resource Group: Specifies the resource group for the DataWorks timed scheduling feature. If you have not created a DataWorks resource group, click Create Now in the drop-down list to open the creation page. After creating the resource group, attach it to the current workspace.

For more information about scheduling parameters, see Time property configuration description.
Multi-version management: Isolate development and production
The version cloning feature lets you publish a tested knowledge base, for example, v1, as a new official version. This isolates the development environment from the production environment.
After you successfully clone a version, you can switch between and manage different versions on the knowledge base details page. You can also select the desired version in the knowledge base retrieval node of an application flow.

Cloning a version is similar to updating an index. This operation submits a workflow task. You can view the task in the operation records.

Troubleshooting: View workflow tasks
After you update an index or clone a version, click Operation Records in the upper-right corner. Select the target task and click View Task in the Actions column to view the execution details of each node in the task, including run information, task logs, and output results.


For example, the workflow task for updating the index of a document knowledge base includes the following three nodes. Except for the read-oss-file node, each node creates a PAI-DLC task. You can also view the DLC task details using the Job URL in the logs.
read-oss-file: Reads OSS files.
rag-parse-chunk: Handles document preprocessing and chunking.
rag-sync-index: Handles text chunk embedding and synchronization to the vector database.
Asset management: View datasets
After the index update task is successful, the system automatically registers the Output OSS Path as a dataset. You can view the dataset in AI Asset Management - Datasets. This dataset has the same name as the knowledge base and records the output information from the index building process.
