All Products
Search
Document Center

Platform For AI:Evaluate an application flow

Last Updated:Mar 12, 2026

Evaluate application flow performance using evaluation flows that score outputs across multiple dimensions with built-in templates.

How evaluation works

To submit an evaluation task, configure an evaluation dataset, map application flow inputs, and select required evaluation templates. The evaluation process: the application flow batch-processes each line in the dataset to generate outputs, then each output is evaluated using auxiliary fields from the dataset, and scores are aggregated to determine accuracy.

image

Prerequisites

  • An application flow is created and debugged. For more information, see Develop an application flow.

  • An evaluation dataset is uploaded to OSS in JSON Lines (JSONL) format:

    {"history":[],"query": "Describe the steepness and majesty of Mount Hua", "reference": "Mount Hua stands alone, soaring to the clouds; \nSheer cliffs cut the sky, with rugged, handsome crags. \nGreen pines and bamboo vie for beauty on the cliffs; \nMonkeys cry and eagles fly, lit by frosty swords of light. \n\nPerilous peaks like scissors, jagged swords pointing to the sky; \nNarrow paths on steep slopes, where vines are the only way. \nWind and mist intertwine, as clouds emerge from caves; \nA deep fairyland, with a heavenly ladder hard to climb. \n\nJagged ridges cross, like a surging dragon's spine; \nDangerous paths lead onward, twisting toward the heavens. \nFrom lonely pine tops, eagles strike the vast sky; \nAt the summit of Mount Hua, a majestic and heroic sight.", "contexts": ["Mount Hua is one of the Five Great Mountains", "Mount Hua is famous for its steepness"]}
    {"history":[],"query": "Can you list 5 rare metals? Please rank them by global demand.", "reference": "Rare metals are metallic elements that are scarce in the Earth's crust, unevenly distributed, or difficult to mine. They play a crucial role in high-tech fields and emerging industries. The ranking of global demand can change with time and technological progress, but the following are some rare metals that are typically in high demand. This list is not necessarily ranked by absolute demand, as that can vary at different times.\n\n1. **Cobalt (Co)** - Cobalt is a key component of lithium-ion batteries, especially in electric vehicles and portable electronics. It is also used to manufacture heat-resistant alloys, hard alloys, and catalysts.\n\n2. **Neodymium (Nd)** - Neodymium is a rare-earth metal mainly used to produce strong magnets, such as high-performance permanent magnets. These magnets are widely used in computer hard drives, wind turbines, and the drive motors of electric vehicles.\n\n3. **Lithium (Li)** - Lithium is primarily used to manufacture lithium batteries. As the demand for electric vehicles and portable electronic devices increases, the demand for lithium is rising rapidly.\n\n4. **Silver (Ag)** - Although silver is not as rare as the metals listed above, its industrial demand is huge. It is mainly used in electronics, solar panels, jewelry, and currency manufacturing.\n\n5. **Ruthenium (Ru)** - Ruthenium is a rare precious metal widely used for data storage in hard disk drives and large-capacity servers. It is also used in catalysts and electrochemical cells.\n\nThe demand for these metals is influenced by many factors, such as the global economy, technological development, and policy support. Moreover, as time passes and markets change, other rare metals such as tantalum, indium, rhenium, and other rare-earth metals may also appear on the list of most in-demand rare metals.", "contexts": ["Rare metals are metals with low abundance in the Earth's crust that are complex to mine and extract.", "Lithium (Li): Used in battery manufacturing.", "Cobalt (Co): Used in high-performance alloys and battery manufacturing."]}

    Sample file: langstudio_eval_demo.jsonl

  • LLM and embedding connections required for evaluation are created. For more information, see Configure connections.

    Note: Some evaluation templates require a judge model or an embedding model. Configure the LLM and embedding connections accordingly.

Billing

Application flow evaluation uses Object Storage Service (OSS) to store evaluation datasets and PAI-DLC to run offline evaluation tasks. Resource usage fees apply. For more information, see OSS Billing overview and Billing of Deep Learning Containers (DLC).

Create an evaluation task

After debugging an application flow, click Evaluation in the upper-right corner to create an evaluation task.

image

Key parameters:

Parameter

Description

Evaluation dataset

OSS file

Select an evaluation dataset file in JSONL format from OSS. The dataset must contain a 'question' field and other fields required for evaluation. The 'question' field is used as input for the application flow. Other fields calculate metric scores. For more information, see the 'Input fields' section in Evaluation templates.

Application flow input mapping

chat_history/question

Select input fields for the application flow run.

Note: Before evaluation, run the application flow for inference first. Evaluation tasks run based on inference results. Select the input fields required for the run.

Evaluation configuration

Preset template evaluation

Multiple preset evaluation templates are available. Select templates as needed. If multiple templates are selected, evaluation results are aggregated on the task details page. This example uses the Answer Relevancy template. When selecting this template, complete these configurations:

image

Key parameters:

  • Connection configuration > Judge LLM configuration: Select an LLM as the judge model. The judge model evaluates whether the query and response match. A powerful model such as qwen-max is recommended.

  • Evaluation template input mapping > query/response: Configure input fields for the evaluation template. Inputs can come from the evaluation dataset or the application flow output. In this example, the Answer Relevancy template accepts a question (query) and a specified answer (response). The template calculates whether the response accurately answers the query and provides an accuracy score. The template's query must come from the 'question' field in the dataset, and the template's response must come from the application flow output.

For more information about templates, see Evaluation templates.

Resource configuration: Resources used only for scheduling evaluation tasks. Select appropriate CPU resources based on task complexity.

View evaluation results

After submitting an evaluation task, the task Overview page opens. Each evaluation run contains one batch run task and N metric evaluation tasks, where N is the number of selected templates. The batch run task uses the application flow to batch-process each line in the dataset and generate outputs. The metric evaluation tasks use auxiliary fields in the evaluation dataset to score each output. At the bottom of the page, view each subtask's details. After the run completes, view the trace, metrics, and output details for each subtask.

image

On the Metrics page, view all evaluation metric results. For more information about metric names, see Evaluation templates.

image

Evaluation templates

LangStudio provides multiple built-in evaluation templates. Use these templates to evaluate application flow performance across multiple dimensions:

Template name

Description

Metric name

Required model service type

Input fields

Exact Match Evaluation

Evaluates whether application flow output (response) exactly matches the reference answer (reference). Score: 0 (no match) to 1 (exact match).

exact_match_score

Not required

  • reference: String

  • response: String

Answer Relevancy Evaluation

Evaluates the relevancy of application flow output to input. This method depends on an LLM. The LLM scores the input (query) against the answer (response). Score: 0 (irrelevant) to 1 (perfectly relevant).

answer_relevancy

LLM

  • query: String

  • response: String

Answer Correctness Evaluation

Evaluates whether application flow output is correct. This method depends on an LLM. The model scores the question (query) against the answer (response). Score: 1 (worst) to 5 (best).

answer_correctness

LLM

  • query: String

  • response: String

BLEU Score Evaluation

Evaluates the relevancy of application flow output to the reference answer using the BLEU score. Calculates the relevancy between the reference answer (reference) and output (response). Score: 0 (irrelevant) to 1 (perfectly relevant).

bleu-1/bleu-2/bleu-3/bleu-4

Not required

  • reference: String

  • response: String

ROUGE Score Evaluation

Evaluates the relevancy of application flow output (response) to the reference answer (reference) using the ROUGE score. Score: 0 (irrelevant) to 1 (perfectly relevant).

rouge-1-p/rouge-1-r/rouge-1-f/rouge-l-p/rouge-l-r/rouge-l-f

Not required

  • reference: String

  • response: String

Context Relevancy Evaluation

Evaluates the relevancy of retrieved context to input. This method depends on an LLM. The LLM scores the input (query) against the context. Score: 0 (irrelevant) to 1 (perfectly relevant).

context_relevancy

LLM

  • query: String

  • contexts: List[String]

Answer Faithfulness Evaluation

Evaluates whether the answer is derived from the given context. This method depends on an LLM. The model scores the answer (response) against the context (contexts). Score: 0 (completely fabricated) to 1 (fully conforms to context).

answer_faithfulness

LLM

  • response: String

  • contexts: List[String]

Embedding Similarity Evaluation

Evaluates the embedding similarity between output (response) and reference answer (reference). This method depends on an embedding model. It converts both into embedding vectors and calculates cosine similarity. Score: 0 to 1 (higher = more similar).

embedding_similarity

Embedding

  • reference: String

  • response: String