All Products
Search
Document Center

Platform For AI:Knowledge base management

Last Updated:Nov 03, 2025

In a retrieval-augmented generation (RAG) architecture, a knowledge base serves as an external private data source that provides context for a large language model (LLM). The quality of this data directly affects the accuracy of the generated results. You can create a knowledge base once and reuse it in multiple application flows, which greatly improves development efficiency. Knowledge bases also support manual and scheduled updates to ensure they are current and accurate.

How it works

When you create a knowledge base and update its index, the system starts a workflow that uses a PAI-DLC task to process files in the OSS data source. This workflow consists of three core steps:

  1. Preprocessing and chunking: The system automatically reads the specified source files in OSS. It preprocesses and splits them into text chunks suitable for retrieval. Structured data is chunked by row. Images are not chunked.

  2. Vectorization (Embedding): The system calls an embedding model to convert each text chunk or image into a numerical vector that represents its semantic meaning.

  3. Storage and indexing: The system stores the generated vector data in a vector database and creates an index for efficient retrieval.

Core flow: From creation to use

Step 1: Create a knowledge base

Go to LangStudio and select a workspace. On the Knowledge Base tab, click Create Knowledge Base. Configure the parameters as described below.

Basic configuration

Parameter

Description

Knowledge base name

Enter a custom name for the knowledge base.

Data source OSS path

The OSS directory where the knowledge base data is located.

Output OSS path

Stores the intermediate results and index information generated from document parsing. The final output depends on the selected vector database type.

Important

If the Instance RAM Role set for the runtime is the PAI default role, for this parameter, we recommend that you use the same OSS Bucket as the current workspace default storage path and specify any directory within that bucket.

Knowledge base type

Select a type based on your files. The supported file formats for each type are as follows:

  • Documents: Supports .html, .htm, .pdf, .txt, .docx, .md, and .pptx.

  • Structured data: Supports .jsonl, .csv, .xlsx, and .xls.

  • Images: Supports .jpg, .jpeg, .png, and .bmp.

Configuration for specific types

  • Document parsing and chunking configuration (for document knowledge bases)

    • Text chunk size: The maximum number of characters for each text chunk. The default is 1024 characters.

    • Text chunk overlap: The number of overlapping characters between adjacent text chunks. This ensures retrieval coherence. The default is 200 characters. We recommend setting the overlap size to 10% to 20% of the chunk size.

    Note

    For more information, see Tuning chunking parameters.

  • Field Configuration (for structured data knowledge bases): Upload a file, such as animal.csv, or add fields manually. You can specify which data fields are used for indexing and retrieval.

Embedding model and database

Select an embedding model service and a vector database for the knowledge base. If the desired option is not in the drop-down list, you can create one. For more information, see Connection Configuration.

Knowledge base type

Supported embedding types

Supported vector database types

Documents

  • Model Studio LLM service

  • General embedding model

  • AI Search: Open Platform Model Service

  • Elasticsearch vector database

  • Milvus vector database

  • FAISS

Structured data

Images

  • Model Studio LLM service

  • General multimodal embedding model service

  • Elasticsearch vector database

  • Milvus vector database

Vector database recommendations:

  • Production environment: We recommend using Milvus and Elasticsearch. They support processing large-scale vector data.

  • Test environment: We recommend using FAISS because it does not require a separate database (knowledge base files and generated index files are stored in the Output OSS Path). It is suitable for functional testing or handling small-scale files. Large file volumes will significantly impact retrieval and processing performance.

Runtime

Select a runtime to perform operations such as document chunk preview and retrieval testing. These operations require access to the vector database and embedding service.

Note the following runtime settings:

  • If you access the vector database or embedding service through an internal endpoint, ensure that the runtime's VPC is the same as the VPCs of the database and service, or that the VPCs are connected.

  • If you select a custom role for the instance RAM role, you must grant the role permissions to access OSS. We recommend that you grant the AliyunOSSFullAccess permission. For more information, see Granting permissions to a RAM role.

Important

If your runtime version is earlier than 2.1.4, it may not appear in the drop-down list. To resolve this issue, create a new runtime.

Step 2: Upload files

  • Method 1: Upload files directly to the OSS path that is configured as the data source for the knowledge base.

  • Method 2: On the Knowledge Base tab, click the name of the target knowledge base to open its details page. On the Documents or Images tab, upload your files or images.

image

Step 3: Build the index

  1. Update the index. After uploading the files, click Update Index in the upper-right corner. This submits a PAI workflow task that executes a DLC job to preprocess, chunk, and vectorize the files in the OSS data source and then build an index. The task parameters are described below:

    Parameter

    Description

    Computing resources

    The computing resources required to execute the tasks. You can use public resources, or Lingjun resources and general computing resources through resource quotas.

    • When updating the index for an image knowledge base, the number of nodes must be greater than 2.

    • For high-quality extraction of charts from complex PDFs, we recommend using GPU resources with a driver version of 550 or higher when updating the index. The system automatically calls a model for chart recognition or OCR for tasks that meet the resource type and driver version requirements. The images are stored in the chunk_images directory of the output path. When used in an application flow, images in the text are replaced with an HTML <img> tag, such as <img src="temporary_signed_URL">.

    VPC configuration

    If you access the vector database or embedding service through an internal network, ensure that the selected VPC is the same as or connected to the VPCs of these services.

    Embedding configuration

    • Maximum concurrency (for image knowledge bases): The number of concurrent requests to the embedding service. Because the Model Studio multimodal model service limits requests to 120 per minute, increasing this concurrency may trigger throttling.

    • Batch size (for document/structured data knowledge bases): The number of text chunks processed in each batch during vectorization. Setting an appropriate value based on the QPS limit of the model service can improve processing speed.

  2. Preview file chunks or images. After the index update task is successful, you can preview the document chunks or images.

    Note

    For document chunks already stored in Milvus, you can individually set their status to Enabled or Disabled. Disabled chunks will not be retrieved.

    image

Step 4: Test the retrieval effect

After updating the index, switch to the Retrieval Test tab. Enter a query and adjust the retrieval parameters to test the retrieval results. For an image knowledge base, the test returns a list of images.

image

Retrieval parameters:

  • Top K: The maximum number of relevant text chunks to retrieve from the knowledge base. The value can range from 1 to 100.

  • Score threshold: The similarity score threshold, which ranges from 0 to 1. Only chunks with scores above this threshold are returned. A higher value indicates a stricter similarity requirement between the text and the query.

  • Retrieval Pattern: The default pattern is Dense (vector search). If you want to use Hybrid search (vector and keyword), your vector database must be Milvus 2.4.x or later, or Elasticsearch. For more information about how to select a retrieval pattern, see Select a retrieval pattern.

  • Metadata filter condition: Filters the retrieval scope using metadata to improve accuracy. For more information, see Using metadata.

  • Query rewrite: Uses an LLM to optimize user queries that are vague, colloquial, or contextual. This process clarifies the user's intent by making the queries clearer and more complete, which improves retrieval accuracy. For more information about usage scenarios, see Query rewrite.

  • Result Reranking: Reranks the initial retrieval results using a reranking model to move the most relevant results to the top. For more information about usage scenarios, see Result Reranking.

    Note

    Result reranking requires a reranking model. Supported model service connection types include Model Studio LLM service, OpenSearch model service, and general Reranker model service.

Step 5: Use in an application flow

After you complete the test, you can use a knowledge base for information retrieval in an application flow. In the knowledge base node, you can enable the query rewrite and result reranking features and view the rewritten query in the trace.

image

The result is a List[Dict], where each Dict contains the keys content and score, which represent the content of a document chunk and its similarity score with the input query, respectively.

Retrieval optimization

Tune chunk parameters: Build a foundation for retrieval

Guiding principles

  1. Model context limit: Ensure that the chunk size does not exceed the token limit of the embedding model to avoid information truncation.

  2. Information integrity: Chunks should be large enough to contain complete semantics but small enough to avoid including too much information, which can reduce the precision of similarity calculations. If the text is organized by paragraphs, you can configure the chunking to align with paragraphs to avoid arbitrary splits.

  3. Maintain continuity: Set an appropriate overlap size, we recommend 10% to 20% of the chunk size, to effectively prevent context loss caused by splitting key information at chunk boundaries.

  4. Avoid repetitive interference: Excessive overlap can introduce duplicate information and reduce retrieval efficiency. You need to find a balance between information integrity and redundancy.

Debugging suggestions

  • Iterative optimization: Start with initial values, such as a chunk size of 300 and an overlap of 50. Then, continuously adjust and experiment with these values based on the actual retrieval and question-answering (Q&A) results to find the optimal settings for your data.

  • Natural language boundaries: If the text has a clear structure, such as chapters or paragraphs, prioritize splitting based on these natural language boundaries to maximize semantic integrity.

Quick optimization guide

Issue

Optimization suggestion

Irrelevant retrieval results

Increase the chunk size and decrease the chunk overlap.

Incoherent context in results

Increase the chunk overlap.

No suitable matches found (low recall rate)

Slightly increase the chunk size.

High computing or storage overhead

Decrease the chunk size and reduce the chunk overlap.

The following table lists the recommended chunk and overlap sizes for different types of text.

Text type

Recommended chunk size

Recommended chunk overlap

Short text (FAQs, summaries)

100 to 300

20 to 50

Regular text (news, blogs)

300 to 600

50 to 100

Technical documents (APIs, papers)

600 to 1024

100 to 200

Long documents (legal, books)

1024 to 2048

200 to 400

Select a retrieval mode: Balance semantics and keywords

The retrieval pattern determines how the system matches your query with the content in the knowledge base. Different patterns have unique advantages and disadvantages and are suitable for different scenarios.

  • Dense (vector) retrieval: Excels at understanding semantics. It converts both the query and documents into vectors and determines semantic relevance by calculating the similarity between these vectors.

  • Sparse (keyword) retrieval: Excels at exact matching. It is based on traditional term frequency models, such as BM25, and calculates relevance based on the frequency and position of keywords in a document.

  • Hybrid retrieval: Combines both. It merges the results of vector and keyword retrieval and reranks them using algorithms such as Reciprocal Rank Fusion (RRF) or weighted fusion, such as linear weighting or model ensemble.

Retrieval mode

Pros and cons

Scenarios

Dense (vector) retrieval

  • Pros: Strong semantic understanding. Can capture complex semantic relationships such as synonyms and contextual associations. Friendly to complex queries. Suitable for long texts and open-domain Q&A.

  • Cons: Not sensitive to keywords. May miss exact term matches. Effectiveness depends on the quality of the embedding model.

  • Open-domain Q&A: Scenarios requiring deep semantic understanding (such as academic paper retrieval, general knowledge Q&A).

  • Polysemy/synonym scenarios: The query and document use different words but are semantically related (such as "heart disease" and "myocardial infarction").

  • Long text matching: Such as retrieving paragraphs from technical documents or long reports.

Sparse (keyword) retrieval

  • Pros: Precise keyword matching. Results are highly interpretable. Easy to debug and optimize.

  • Cons: Cannot understand semantics. Handles synonyms or inconsistent wording poorly. Depends on tokenization quality. Sensitive to spelling and tokenization errors.

  • Structured data retrieval: Such as database field queries or table data matching.

  • Scenarios with clear keywords: The user inputs precise terms (such as "IPv6 address format").

  • Low-resource languages: Does not require high-quality pre-trained vector models. Suitable for languages with scarce resources.

Hybrid retrieval

  • Pros: Balances semantic understanding and keyword matching. More robust (maintains effectiveness even if a single retrieval mode fails). Usually provides the best results.

  • Cons: Requires running two retrieval systems simultaneously, resulting in higher computational costs. Requires tuning of fusion weights and parameters, which is complex.

  • Complex mixed requirements: Needs to satisfy both semantic matching and precise keyword matching (such as in medical Q&A, which requires understanding symptom descriptions and matching professional terms).

  • High demand for result diversity: Avoids result homogenization caused by a single mode (such as in e-commerce search, which needs to cover both "price-sensitive" and "semantically related" user needs).

  • Cold-start phase: When high-quality vector models are unavailable, mixing in keyword results can improve initial performance.

Use metadata: Filter retrieval

Value of metadata filtering

  1. Precise retrieval, less noise: Metadata can be used as a filter condition or sorting basis during retrieval. By filtering with metadata, you can exclude irrelevant documents and prevent the generation model from receiving unrelated content. For example, if a user asks to "find sci-fi novels written by Liu Cixin," the system can use the metadata conditions author=Liu Cixin and category=sci-fi to directly locate the most relevant documents.

  2. Improved user experience

    • Supports personalized recommendation: You can use metadata to provide personalized recommendations based on a user's historical preferences, such as a preference for "sci-fi" documents.

    • Enhances result interpretability: Including document metadata, such as author, source, and date, in the results helps users judge the credibility and relevance of the content.

    • Supports multilingual or multimodal expansion: Metadata such as "language" and "media type" makes it easy to manage complex knowledge bases that contain multiple languages and mixed text and images.

How to use

Important

Feature limitations:

  • Runtime image version: Must be 2.1.8 or later.

  • Vector database: Only Milvus and Elasticsearch are supported.

  • Knowledge base type: Supports documents or structured data. Images are not supported.

  1. Configure metadata variables. For knowledge bases that use only Milvus, go to the Metadata section on the Overview tab and click Edit to configure variables, such as author. Do not use system-reserved fields.

    image

  2. Tag documents. On the document chunk details page, click Edit Metadata to add a metadata variable and value, such as author=Alex. You can then return to the overview page to view the metadata reference status and the number of values.

    image

  3. Test the filter. On the Retrieval Test tab, add a metadata filter condition and run a test.

    image

    Note: The documents retrieved in the image are the documents that were tagged in Step 2.

  4. Use in an application flow. Configure the metadata filter condition in the knowledge base node.

    image

Query rewrite and result reranking: Optimize the retrieval chain

Query rewrite

Uses an LLM to rewrite a user's vague, colloquial, or context-dependent query into a clearer, more complete, and independent question. This improves the accuracy of the subsequent retrieval.

  • Recommended scenarios:

    • The user's query is vague or incomplete, such as "When was he born?" without context.

    • In a multi-turn conversation, the query depends on context, such as "What did he do after that?".

    • The retriever or LLM is not powerful enough to accurately understand the original query.

    • You are using traditional inverted index retrieval, such as BM25, instead of semantic retrieval.

  • Scenarios where it is not recommended:

    • The user's query is already very clear and specific.

    • The LLM is powerful and can accurately understand the original query.

    • The system requires low latency and cannot tolerate the additional delay caused by rewriting.

Result reranking

Reranks the initial results returned by the retriever to display the most relevant documents first, which improves the ranking quality.

  • Recommended scenarios:

    • The quality of results from the initial retriever, such as BM25 or DPR, is unstable.

    • The ranking of retrieval results is critical, for example, when Top-1 accuracy is required in search or Q&A systems.

  • Scenarios where it is not recommended:

    • System resources are limited and cannot handle the additional inference overhead.

    • The initial retriever is already powerful enough, and reranking provides limited improvement.

    • Response time is extremely critical, such as in real-time search scenarios.

Operations and management

Configure timed scheduling: Automatic updates

Important

The scheduled update feature depends on DataWorks. Make sure that you have activated the service. If the service is not activated, see Activate DataWorks Service.

On the knowledge base details page, click More > Configure Timed Scheduling in the upper-right corner, complete the configuration, and click Submit.

image

  • View scheduling configurations and recurring tasks

    After you submit the form, the system automatically creates a scheduled workflow for the knowledge base in the DataWorks Data Development center. The system then publishes the workflow as a recurring task in the DataWorks Operation Center. Recurring tasks take effect on the next day (T+1). The DataWorks recurring task updates the knowledge base at your configured time. You can view the scheduling configuration and recurring tasks on the scheduling configuration page of the knowledge base.

    image

  • Scheduled update parameters:

    • Scheduling Cycle: Defines how often the node runs in the production environment. This setting determines the number of recurring instances and their runtimes.

    • Scheduling Time: Defines the specific time when the node runs.

    • Timeout Definition: Defines the maximum duration that the node can run before it times out and exits.

    • Effective Date: Defines the time range during which the node is automatically scheduled to run. The node is not automatically scheduled outside this range.

    • Scheduling Resource Group: Specifies the resource group for the DataWorks timed scheduling feature. If you have not created a DataWorks resource group, click Create Now in the drop-down list to open the creation page. After creating the resource group, attach it to the current workspace.

      image

    For more information about scheduling parameters, see Time property configuration description.

Multi-version management: Isolate development and production

The version cloning feature lets you publish a tested knowledge base, for example, v1, as a new official version. This isolates the development environment from the production environment.

After you successfully clone a version, you can switch between and manage different versions on the knowledge base details page. You can also select the desired version in the knowledge base retrieval node of an application flow.

image

Cloning a version is similar to updating an index. This operation submits a workflow task. You can view the task in the operation records.

image

Troubleshooting: View workflow tasks

After you update an index or clone a version, click Operation Records in the upper-right corner. Select the target task and click View Task in the Actions column to view the execution details of each node in the task, including run information, task logs, and output results.

image

image

For example, the workflow task for updating the index of a document knowledge base includes the following three nodes. Except for the read-oss-file node, each node creates a PAI-DLC task. You can also view the DLC task details using the Job URL in the logs.

  • read-oss-file: Reads OSS files.

  • rag-parse-chunk: Handles document preprocessing and chunking.

  • rag-sync-index: Handles text chunk embedding and synchronization to the vector database.

Asset management: View datasets

After the index update task is successful, the system automatically registers the Output OSS Path as a dataset. You can view the dataset in AI Asset Management - Datasets. This dataset has the same name as the knowledge base and records the output information from the index building process.

image