All Products
Search
Document Center

OpenSearch:Manage evaluation tasks

Last Updated:Nov 04, 2025

You can use the performance evaluation module to assess the Retrieval-Augmented Generation (RAG) development pipeline provided by AI Search Open Platform. The evaluation covers the entire process, from a user's question to content retrieval by the RAG system and answer generation by the Large Language Model (LLM).

Prerequisites

Activate the AI Search Open Platform service. For more information, see Activate the service.

Procedure

  1. Log on to the AI Search Open Platform console.

  2. Select the Shanghai region, switch to AI Search Open Platform, and then switch to the target workspace.

    Note
    • Currently, the AI Search Open Platform feature is available only in the Shanghai and Germany (Frankfurt) regions.

    • Users in the Hangzhou, Shenzhen, Beijing, Zhangjiakou, and Qingdao regions can call the AI Search Open Platform service across regions using a VPC address.

    • Workspaces are used to isolate and manage data. After you activate the AI Search Open Platform service for the first time, the system automatically creates a Default workspace. You can also create a workspace.

  3. In the navigation pane on the left, choose Effect Evaluation and then click Create Evaluation Task.

  4. On the Create Evaluation Task page, enter a task name and upload the evaluation dataset in the format of the provided Sample data.

    Important
    • An evaluation dataset can contain a maximum of 200 valid data entries. If you exceed this limit, the system reports an error.

    • You must strictly follow the sample template to upload the evaluation dataset. The reference answer is optional. However, for a single dataset, all questions must either include or omit a reference answer.

    image

    The following table describes the evaluation template and key evaluation metrics.

    Parameter

    Description

    question

    Your question.

    standard_answer

    The reference answer. This parameter can be empty, which affects the evaluation metrics returned in the report.

    • If reference answers are provided, the evaluation metrics are as follows:

      • Faithfulness: The hallucination rate between the retrieved documents and the model-generated answer. The value is 0 for hallucination and 1 for no hallucination.

      • Context Precision: The accuracy between the reference answer and the retrieved documents. The value is 1 for accurate and 0 for inaccurate.

      • Context Recall: The integrity between the retrieved documents and the reference answer. The value is 1 for complete retrieval and 0 for incomplete retrieval.

      • Satisfaction: A comparison between the model-generated answer and the reference answer:

        • If the model-generated answer has no hallucinations and is accurate and complete, the satisfaction score is 1.

        • If the model-generated answer has no hallucinations but the information is inaccurate or incomplete, the satisfaction score is 0.5.

        • If the model-generated answer has hallucinations, the satisfaction score is 0.

      • Comprehensive Score: A combined score of faithfulness, context precision, context recall, and satisfaction.

    • If no reference answers are provided, the evaluation metrics are as follows:

      • Context Relevance: The relevance between the question and the retrieved documents. The value is 1 for relevant and 0 for irrelevant.

      • Credibility: The credibility of the model-generated answer in relation to the question.

        • If the model-generated answer has no hallucinations and is generated based on relevant retrieved results (or if the answer is "cannot answer" when no relevant results are retrieved), the credibility score is 1.

        • If the model-generated answer has no hallucinations but is based on irrelevant retrieved results, or if the answer is "cannot answer" despite relevant results being retrieved, the credibility score is 0.5.

        • If the model-generated answer has hallucinations, the credibility score is 0.

      • Faithfulness: The hallucination rate between the retrieved documents and the model-generated answer. The value is 0 for hallucination and 1 for no hallucination.

      • Comprehensive Score: A combined score of context relevance, faithfulness, and credibility.

    recall_docs

    The retrieved documents.

    model_answer

    The answer generated by the model.

  5. After you configure the parameters, click OK to create the evaluation task.

    The following are the evaluation task statuses:

    • Evaluating or Failed: You can delete the evaluation task.

    • Successful: You can download the evaluation report as an Excel file. The report has two parts:

      • Sheet1 - Evaluation Task: Provides an overview of the evaluation task. This sheet shows the average metric values calculated from all successfully evaluated questions.

        Sheet2 - Task Details: Provides detailed evaluation data for each question.

        image