All Products
Search
Document Center

Platform For AI:Knowledge base management

Last Updated:Dec 23, 2025

To use a knowledge base node in a LangStudio application flow, you must first create a knowledge base. You only need to create the knowledge base once, and you can reuse it in multiple application flows. A knowledge base acts as an external private data source in a retrieval-augmented generation (RAG) architecture. It reads source documents from Object Storage Service (OSS), preprocesses and chunks the data, and converts the chunks into vectors. The resulting index is then stored in a vector database. This topic describes how to create, configure, and use a knowledge base.

How it works

A LangStudio Knowledge Base transforms files from an OSS data source into a format retrievable by large language models (LLMs). The workflow consists of three core steps:

  1. Data reading and chunking: The system reads source files from the OSS location you specify.

    • Unstructured documents are parsed and split into smaller, semantically complete text blocks (chunks).

    • Structured data is chunked by row.

    • Images are processed as a whole without being chunked.

  2. Vectorization (Embedding): The system calls an embedding model to convert each data chunk or image into a numerical vector that represents its semantic meaning.

  3. Storage and indexing: The system stores the vector data in a vector database and creates an index for efficient retrieval.

Getting Started: Create and use a knowledge base

This section shows you how to quickly create a Document-type knowledge base and use it in an Application Flow.

  1. Create a knowledge base. Navigate to LangStudio and select a workspace. On the Knowledge Base tab, click Create Knowledge Base. Configure the following key parameters, leave the rest as default, and then click OK.

    Parameter

    Description

    Basic Configuration

    Name

    Enter a custom name for the knowledge base, such as test_kg.

    Data Source OSS Path

    The location where the knowledge base source files are stored. For example: oss://examplebucket.oss-cn-hangzhou-internal.aliyuncs.com/test/original/.

    Output OSS Path

    Stores the intermediate results and index information generated from document parsing. The final output depends on the selected vector database type. For example: oss://examplebucket.oss-cn-hangzhou-internal.aliyuncs.com/test/output/.

    Important

    If the Instance RAM Role set for the runtime is the PAI default role, we recommend that you set this parameter to a directory within the current workspace default storage path's OSS Bucket.

    Type

    Select Document.

    Embedding Model and Database

    Embedding Type

    Select Alibaba Cloud Model Studio Service (you must create a connection in advance, see Connection Configuration), and then select the created connection and model.

    Vector Database Type

    Select FAISS for quick testing.

  2. Upload files.

    1. On the Knowledge Base tab, click the knowledge base you just created. On the Overview page, switch to the Documents tab. This tab displays the documents from the OSS path you configured as the data source.

    2. You can add or update files using the Upload button on the page, or you can upload files directly to the data source OSS path. For example, upload rag_test_doc.txt through the page. For more information on supported file formats, see Knowledge Base Types.

      image

  3. Update the index. After uploading files, click Update Index in the upper-right corner. In the dialog box that appears, configure the computing resources and network. After the index update task succeeds, the file status changes to Indexed. You can click a file to preview its chunks. For an image knowledge base, the system returns a list of images.

    Note

    For document chunks stored in Milvus, you can set their status to Enabled or Disabled individually. Disabled chunks will not be retrieved during a search.

    image

  4. Run a retrieval test. After the index is updated, switch to the Recall Test tab. Enter a query and tune the retrieval parameters to test the retrieval performance.

    image

  5. Use the knowledge base in an Application Flow. After testing, you can retrieve information from the knowledge base in an application flow. In the knowledge base node, you can enable the query rewriting and result reranking features and view the rewritten query in the execution trace.

    image

    The result is a List[Dict]. Each Dict has content and score keys, which represent the retrieved document chunk and its similarity score with the input query.

    [
      {
        "score": 0.8057173490524292,
        "content": "Due to the uncertainty caused by the pandemic, XX Bank proactively increased provisions for impairment losses on loans, advances, and non-credit assets based on economic trends and forecasts for China or the Chinese mainland. The bank also increased the write-off and disposal of non-performing assets to improve the provision coverage ratio. In 2020, the net profit reached 28.928 billion CNY, a year-on-year increase of 2.6%, and profitability gradually improved.\n(CNY in millions) 2020 2019 Change (%)\nOperating Results and Profitability\nOperating Income 153,542 137,958 11.3\nOperating Profit Before Impairment Losses 107,327 95,816 12.0\nNet Profit 28,928 28,195 2.6\nCost-to-Income Ratio(1)(%) 29.11 29.61 down 0.50 percentage points\nAverage Return on Total Assets (%) 0.69 0.77 down 0.08 percentage points\nWeighted Average Return on Equity (%) 9.58 11.30 down 1.72 percentage points\nNet Interest Margin(2)(%) 2.53 2.62 down 0.09 percentage points\nNote: (1) Cost-to-Income Ratio = Business and management fees / Operating income.",
        "id": "49f04c4cb1d48cbad130647bd0d75f***1cf07c4aeb7a5d9a1f3bda950a6b86e",
        "metadata": {
          "page_label": "40",
          "file_name": "2021-02-04_XX_Insurance_Group_Co_Ltd_XX_China_XX_2020_Annual_Report.pdf",
          "file_path": "oss://my-bucket-name/datasets/chatglm-fintech/2021-02-04__XX_Insurance_Group_Co_Ltd__601318__China_XX__2020_Annual_Report.pdf",
          "file_type": "application/pdf",
          "file_size": 7982999,
          "creation_date": "2024-10-10",
          "last_modified_date": "2024-10-10"
        }
      },
      {
        "score": 0.7708036303520203,
        "content": "7.2 billion CNY, a year-on-year increase of 5.2%.\n2020\n(CNY in millions) Life and Health Insurance Business, Property and Casualty Insurance Business, Banking Business, Trust Business, Securities Business, Other Asset Management Business, Technology Business, Other Business and Consolidation Elimination, Group Consolidated\nNet profit attributable to shareholders of the parent company 95,018 16,083 16,766 2,476 2,959 5,737 7,936 (3,876) 143,099\nMinority interest 1,054 76 12,162 3 143 974 1,567 281 16,260\nNet profit (A) 96,072 16,159 28,928 2,479 3,102 6,711 9,503 (3,595) 159,359\nItems excluded:\n Short-term investment fluctuation(1)(B) 10,308 – – – – – – – 10,308\n Impact of discount rate change (C) (7,902) – – – – – – – (7,902)\n One-time significant items and others excluded by management as not part of daily operating income and expenditure (D) – – – – – – 1,282 – 1,282\nOperating profit (E=A-B-C-D) 93,666 16,159 28,928 2,479 3,102 6,711 8,221 (3,595) 155,670\nOperating profit attributable to shareholders of the parent company 92,672 16,",
        "id": "8066c16048bd722d030a85ee8b1***36d5f31624b28f1c0c15943855c5ae5c9f",
        "metadata": {
          "page_label": "19",
          "file_name": "2021-02-04_XX_Insurance_Group_Co_Ltd_XXX_China_XX_2020_Annual_Report.pdf",
          "file_path": "oss://my-bucket-name/datasets/chatglm-fintech/2021-02-04__XX_Insurance_Group_Co_Ltd__601318__China_XX__2020_Annual_Report.pdf",
          "file_type": "application/pdf",
          "file_size": 7982999,
          "creation_date": "2024-10-10",
          "last_modified_date": "2024-10-10"
        }
      }
    ]

Detailed features

Knowledge base type

Knowledge bases are categorized into three types: Document, Structured data, and Image. Choose the knowledge base type that matches your files.

  • Documents: Supports .html, .htm, .pdf, .txt, .docx, .md, and .pptx.

  • Structured data: Supports .jsonl, .csv, .xlsx, and .xls.

  • Images: Supports .jpg, .jpeg, .png, and .bmp.

Special configurations:

  • For Document-type knowledge bases, you must configure Chunk Configuration. The following fields are required. For guidance on setting chunking parameters, see Chunking parameter tuning.

    • Chunk Size: Sets the maximum number of characters for each text chunk. The default is 1024 characters.

    • Chunk Overlap: Sets the number of overlapping characters between adjacent text chunks to ensure contextual continuity. The default is 200 characters.

  • For Structured data-type knowledge bases, you must configure Field Settings. You can upload a file, such as animal.csv or add fields manually to specify which data fields participate in indexing and retrieval.

Choose a vector database

  • Production environments: We recommend using Milvus or Elasticsearch, which support large-scale vector data processing.

  • Test environment: We recommend using FAISS, which requires no additional database creation. The knowledge base files and generated index files are stored in the Output OSS Path. This is suitable for functional testing or handling small-scale datasets, but performance may degrade significantly with a large volume of data.

    Note

    Image-type knowledge bases do not support FAISS.

Index update strategy

Update method

Description

Notes

Manual update

Manually click Update Index in the console. Suitable for scenarios where files do not change frequently.

Each update processes files in the data source, either fully or incrementally.

Automatic update

After you enable automatic updates in the console, the system automatically creates an event rule in EventBridge to forward OSS file change messages, which in turn automatically creates an indexing task.

Important

Message fees are incurred during the automatic update service.

  • The rule takes a few minutes to become effective. Wait at least 3 minutes before modifying OSS files.

  • Currently supported regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), and China (Shenzhen).

Timed update

Configure a recurring task in DataWorks to update the index at a specified frequency (for example, daily).

This feature depends on the DataWorks service. DataWorks recurring tasks typically take effect on a T+1 basis, meaning a configuration made today will run for the first time tomorrow.

The following are the configuration methods:

Manual update

After uploading the file, click Update Index in the upper-right corner. The system submits a PAI workflow task to preprocess, chunk, and vectorize the files in the OSS data source and build the index. The following table describes the task parameters.

Parameter

Description

Compute resource

The computing resources required to execute workflow node tasks. You can use public resources, or you can use Lingjun resources and general computing resources through resource quotas.

  • When updating the index for an Image-type knowledge base, the number of nodes must be greater than 2.

  • To achieve high-quality extraction of charts from complex PDFs, we recommend using GPU resources with a driver version of 550 or higher when updating the index. The system automatically uses models for chart recognition or OCR for tasks that meet the resource type and driver version requirements. The images are stored in the chunk_images directory of the output path. When used in an Application Flow, images in the text are replaced with an HTML <img> tag, such as <img src="temporary-signed-url">.

VPC configuration

If you access the vector database or Embedding service via an internal network, ensure that the selected Virtual Private Cloud (VPC) is the same as or can communicate with the VPCs of those services.

Embedding configuration

  • Maximum concurrency (required for Image-type knowledge bases): The number of concurrent requests to the Embedding service. Because the Model Studio multi-modal model service limits requests to 120 per minute, increasing this concurrency may trigger rate limiting.

  • Batch size (required for Document/Structured data-type knowledge bases): The number of text chunks processed in each batch during vectorization. Setting an appropriate value based on the QPS limit of the model service can improve processing speed.

Automatic update

  1. Go to the EventBridge console and activate EventBridge.

  2. Configure automatic index updates. Go to the knowledge base details page. On the Overview tab, in the Automatic File Indexing section in the lower-right corner, click Modify.image

  3. Configure the computing resources and VPC, and then click OK. After this, file changes automatically trigger indexing tasks without manual intervention.

    Important

    The computing resources configured here are only used when files are updated. No resource fees are incurred if files do not change.

  4. Make changes to OSS files.

    • After you configure automatic file updates, there is a delay of a few minutes for the rule to take effect. We recommend waiting at least 3 minutes before operating on files.

    • To delete a file by using the OSS API, you must specify a version to trigger the change event.

    • To delete a file in the console, select the file and click Permanently Delete at the bottom.

      image

  5. View indexing tasks. After a file is changed, wait about 3 minutes. You can then see the automatically triggered index-building task in the operation records list.

Timed update

Important

The scheduled update feature relies on DataWorks. Ensure you have activated this service. If the service is not activated, see Purchasing guide.

On the knowledge base details page, click More > Configure Scheduling in the upper-right corner, complete the configurations, and submit.

image

  • View scheduling configurations and recurring tasks

    After you submit the form, the system automatically creates a workflow for the scheduled knowledge base update in DataWorks DataStudio and publishes it as a recurring task in DataWorks Operation Center. Recurring tasks currently take effect on a T+1 basis. The DataWorks recurring task updates the knowledge base at the time you configured. You can view the scheduling configuration and recurring tasks on the scheduling configuration page of the knowledge base.

    image

  • Scheduled configuration parameter descriptions:

    • Scheduling cycle: Defines how often the node runs in the production environment (the number of cycle instances generated and the time they run).

    • Scheduled time: Defines the specific time the node runs.

    • Timeout definition: Defines the duration after which a running node will fail and exit.

    • Effective date: Defines the time range during which the node will run on its automatic schedule. Outside this range, the node will no longer be scheduled automatically.

    • Scheduling resource group: Used for the DataWorks scheduled update feature. If you have not yet created a DataWorks resource group, you can click Create Now in the dropdown list to go to the creation page. After creation, you must bind the resource group to the current workspace.

      image

    For more information about scheduling parameters, see Time property configuration description.

View the dataset

After an index update task succeeds, the system automatically registers the Output OSS Path as a dataset. You can view it in AI Asset Management - Datasets. This dataset, which has the same name as the knowledge base, records the output information from the index-building process.

image

Configure the runtime

Select a runtime to perform operations such as previewing document chunks and running retrieval tests. These operations require access to the vector database and the Embedding service.

Note the following runtime settings:

  • If you access the vector database or Embedding service via an internal network address, ensure the runtime's VPC is the same as or can communicate with theirs.

  • If you select a custom role for the Instance RAM Role, you must grant that role OSS access permissions (we recommend granting the AliyunOSSFullAccess permission). For details, see Granting permissions to a RAM role.

Important

If your runtime version is too old (earlier than 2.1.4), it may not be available for selection in the dropdown list. Please create a new runtime.

Manage multiple versions

The version cloning feature lets you publish a tested and validated knowledge base (for example, v1) as a new official version and isolate development and production environments.

After cloning a version, you can switch between and manage different versions using the dropdown next to the type on the knowledge base details page. You can also select the desired version in the knowledge base node of an Application Flow.

image

Cloning a version is similar to updating an index; it submits a workflow task. You can view the task in the operation records.

image

Configure retrieval parameters

  • Top K: The maximum number of relevant text chunks to retrieve from the knowledge base. The value range is 1 to 100.

  • Score threshold: Sets a similarity score threshold (0 to 1). Only chunks with a score above this threshold are returned. A higher value means a higher required similarity between the text and the query.

  • Retrieval Pattern: The default is Dense (vector) retrieval. If you need to use Hybrid retrieval (vector + keyword), the vector database must be Milvus 2.4.x or later, or Elasticsearch. For guidance on choosing a retrieval mode, see Select a retrieval mode.

  • Metadata filter condition: Filters the search scope using metadata to improve accuracy. For more information, see Use metadata.

  • Query rewrite: Uses a large language model (LLM) to optimize a user's vague, colloquial, or context-dependent original query, making it clearer, more complete, and more intentional to improve retrieval accuracy. For more information about usage scenarios, see Query rewrite.

  • Result Reranking: Uses a reranking model to reorder the initial retrieval results, placing the most relevant results at the top. For more information about usage scenarios, see Result Reranking.

    Note

    Result reranking requires a reranking model. Supported model service connection types include: Model Studio, AI Search Open Platform Model Service, and General Reranker Model Service.

Troubleshooting

When an index update or version cloning task fails, follow these steps to troubleshoot:

  1. View operation records: On the knowledge base details page, find the failed task in the Operation Records and click View Task.image

  2. Check task logs: The system redirects you to the PAI workflow page. Check the logs of the failed node.imageFor example, the workflow task for updating the index of a Document-type knowledge base includes the following three nodes. Except for the read-oss-file node, each node creates a PAI-DLC task. You can also view DLC task details through the Job URL in the logs.

    • read-oss-file: Reads OSS files.

    • rag-parse-chunk: Responsible for document preprocessing and chunking.

    • rag-sync-index: Responsible for embedding text chunks and synchronizing them to the vector database.

Optimize retrieval performance

Tune chunking parameters

Guiding principles

  1. Model context limit: Ensure the chunk size does not exceed the token limit of the Embedding model to avoid information truncation.

  2. Information integrity: Chunks should be large enough to contain complete semantic meaning but small enough to avoid including noise that could reduce the precision of similarity calculations. If the text is organized into paragraphs, consider chunking by paragraph instead of splitting text arbitrarily.

  3. Maintain continuity: Set an appropriate overlap size (we recommend 10%-20% of the chunk size) to prevent context loss when key information is split across chunk boundaries.

  4. Avoid repetitive interference: Too much overlap can introduce redundant information and affect retrieval efficiency. Find a balance between information integrity and redundancy.

Debugging suggestions

  • Iterative optimization: Start with an initial value (such as a chunk size of 300 and an overlap of 50) and iterate based on retrieval and Q&A results to find the optimal settings for your data.

  • Natural language boundaries: If your text has a clear structure (for example, divided by chapters or paragraphs), consider splitting it along its natural boundaries to preserve semantic integrity.

Quick optimization guide

Issue

Optimization suggestion

Retrieval results are irrelevant

Increase the chunk size, decrease the chunk overlap.

Result context is not coherent

Increase the chunk overlap.

Cannot find a suitable match (low recall)

Moderately increase the chunk size.

High compute or storage costs

Decrease the chunk size, decrease the chunk overlap.

The following table provides recommended chunk and overlap sizes for different types of text based on past experience:

Text type

Recommended chunk size (chunk_size)

Recommended overlap size (chunk_overlap)

Short text (FAQ, summary)

100 to 300

20 to 50

Regular text (news, blog)

300 to 600

50 to 100

Technical documents (API, paper)

600 to 1024

100 to 200

Long documents (legal, book)

1024 to 2048

200 to 400

Choose a retrieval mode: Balance semantics and keywords

The retrieval mode determines how the system matches a query to content in the knowledge base. Each mode has strengths suited to different scenarios.

  • Dense (vector) retrieval: Excels at understanding semantics. It converts both the query and documents into vectors and determines semantic relevance by calculating the similarity between these vectors.

  • Sparse (keyword) retrieval: Excels at exact matching. Based on traditional term frequency models (like BM25), it calculates relevance based on the frequency and position of keywords in a document.

  • Hybrid retrieval: Combines the best of both. It merges the results of both vector search and keyword search and re-ranks them using algorithms like Reciprocal Rank Fusion (RRF) or weighted fusion.

Retrieval mode

Pros and cons

Scenarios

Dense (vector) retrieval

  • Pros: Strong semantic understanding, captures complex relationships like synonyms and context. Handles complex queries well, suitable for long text and open-domain Q&A.

  • Cons: Insensitive to keywords, may miss exact term matches; performance depends on the quality of the Embedding model.

  • Open-domain Q&A: Scenarios requiring deep semantic understanding (such as academic paper retrieval, general knowledge Q&A).

  • Polysemy/synonym scenarios: When the query and document use different words but have related meanings (e.g., "heart disease" vs. "myocardial infarction").

  • Long text matching: Retrieving paragraphs from technical documents or long reports.

Sparse (keyword) retrieval

  • Pros: Precise keyword matching, results are highly interpretable and easy to debug.

  • Cons: Cannot understand semantics, performs poorly with synonyms or inconsistent terminology. Relies on tokenization quality and is sensitive to spelling and tokenization errors.

  • Structured data retrieval: Such as database field queries or table data matching.

  • Keyword-specific scenarios: When users input precise terms (e.g., "IPv6 address format").

  • Low-resource languages: Does not require high-quality pre-trained vector models, suitable for languages with scarce resources.

Hybrid retrieval

  • Pros: Balances semantic understanding and keyword matching, more robust (maintains performance even if one mode fails), and generally yields the best results.

  • Cons: Requires running two retrieval systems simultaneously, leading to higher compute costs. Requires tuning fusion weights and parameters, which can be complex.

  • Complex mixed requirements: When both semantic matching and exact keyword matching are needed (e.g., medical Q&A that needs to understand symptom descriptions and match professional terms).

  • High demand for result diversity: To avoid the homogeneity caused by a single retrieval mode (e.g., e-commerce search needing to cover both "price-sensitive" and "semantically-related" user needs).

  • Cold-start phase: When high-quality vector models are unavailable, mixing in keyword results can improve initial performance.

Use metadata: Filter retrieval

Value of metadata filtering

  1. Precise retrieval, less noise: Metadata can serve as a filter during retrieval. Filtering with metadata lets you exclude irrelevant documents and prevents the generation model from receiving unrelated content. For example, when a user asks, "Find science fiction novels written by Liu Cixin," the system can use the metadata conditions author=Liu Cixin and category=science fiction to directly locate the most relevant documents.

  2. Improved user experience

    • Personalized recommendations: Use metadata to provide personalized recommendations based on a user's historical preferences (such as a preference for "sci-fi" documents).

    • Improved interpretability: Including a document's metadata (such as author, source, date) in the results helps users judge its credibility and relevance.

    • Supports multilingual or multimodal expansion: Metadata like "language" or "media type" simplifies managing knowledge bases with multiple languages or mixed media such as text and images.

How to use

Important

Feature limitations:

  • Runtime image version: Must be 2.1.8 or later.

  • Vector database: Only Milvus and Elasticsearch are supported.

  • Knowledge base type: Supports documents or structured data. Images are not supported.

  1. Configure metadata variables. For knowledge bases using Milvus, find the Metadata section on the Overview tab. Click Edit to configure variables (for example, a variable named author). Do not use system-reserved fields.

    image

  2. Tag documents. Go to the document chunk details page, click Edit Metadata, and add the metadata variable and its value (for example, author=Alex). On the Overview page, you can see the metadata usage and value count.

    image

  3. Test the filtering effect. On the Recall Test tab, add a metadata filter condition and run a test.

    image

    Note: The documents retrieved in the image were tagged in step 2.

  4. Use in an application flow. Configure the metadata filter condition in the knowledge base node.

    image

Query rewriting and result reranking: optimize the retrieval chain

Query rewrite

Uses an LLM to rewrite a user's vague, colloquial, or context-dependent query into a clearer, more complete, and standalone question, improving subsequent retrieval accuracy.

  • Recommended scenarios:

    • The user's query is vague or incomplete (e.g., "When was he born?" without context).

    • In a multi-turn conversation, the query depends on context (e.g., "What did he do after that?").

    • The retriever or LLM performs poorly and does not accurately understand the original query.

    • You are using a traditional inverted index retrieval method (like BM25) instead of semantic retrieval.

  • Not recommended for:

    • The user's query is already very clear and specific.

    • The LLM performs very well and has a strong understanding of the original query.

    • The system requires low latency and cannot afford the additional delay from the rewrite.

Result reranking

Reorders the initial results returned by the retriever to prioritize the most relevant documents, improving the ranking quality.

  • Recommended scenarios:

    • The initial retriever provides unstable results (such as from BM25 or DPR).

    • The ranking of retrieval results is critical (such as requiring high Top-1 accuracy in search or Q&A systems).

  • Not recommended for:

    • The system has limited resources and cannot afford the additional inference overhead.

    • The performance of the initial retriever is already strong enough, and reranking provides limited improvement.

    • Response time is critical, such as in real-time search scenarios.