Evaluate an application flow - Platform For AI - Alibaba Cloud Documentation Center

Before you deploy an application flow, it is crucial to evaluate its performance in your business scenario. LangStudio provides a comprehensive evaluation feature for application flows. This feature uses a new type of application flow, called an Evaluation Flow, to score an application flow from specified dimensions using evaluation templates.

Introduction

LangStudio provides a comprehensive evaluation feature for application flows. To submit an evaluation task, you must configure an evaluation dataset, map the application flow inputs, and select the required evaluation templates. The evaluation process works as follows: The application flow batch-processes each line in the evaluation dataset to generate outputs. Then, each output is evaluated using auxiliary fields from the dataset. Finally, the scores are aggregated to determine the accuracy of the application flow on the specified dataset.

Preparations

An application flow is created and debugged. For more information, see Develop an application flow.

An evaluation dataset is uploaded to OSS in JSON Lines (JSONL) format. The following code provides an example:

{"history":[],"query": "Describe the steepness and majesty of Mount Hua", "reference": "Mount Hua stands alone, soaring to the clouds; \nSheer cliffs cut the sky, with rugged, handsome crags. \nGreen pines and bamboo vie for beauty on the cliffs; \nMonkeys cry and eagles fly, lit by frosty swords of light. \n\nPerilous peaks like scissors, jagged swords pointing to the sky; \nNarrow paths on steep slopes, where vines are the only way. \nWind and mist intertwine, as clouds emerge from caves; \nA deep fairyland, with a heavenly ladder hard to climb. \n\nJagged ridges cross, like a surging dragon's spine; \nDangerous paths lead onward, twisting toward the heavens. \nFrom lonely pine tops, eagles strike the vast sky; \nAt the summit of Mount Hua, a majestic and heroic sight.", "contexts": ["Mount Hua is one of the Five Great Mountains", "Mount Hua is famous for its steepness"]}
{"history":[],"query": "Can you list 5 rare metals? Please rank them by global demand.", "reference": "Rare metals are metallic elements that are scarce in the Earth's crust, unevenly distributed, or difficult to mine. They play a crucial role in high-tech fields and emerging industries. The ranking of global demand can change with time and technological progress, but the following are some rare metals that are typically in high demand. This list is not necessarily ranked by absolute demand, as that can vary at different times.\n\n1. **Cobalt (Co)** - Cobalt is a key component of lithium-ion batteries, especially in electric vehicles and portable electronics. It is also used to manufacture heat-resistant alloys, hard alloys, and catalysts.\n\n2. **Neodymium (Nd)** - Neodymium is a rare-earth metal mainly used to produce strong magnets, such as high-performance permanent magnets. These magnets are widely used in computer hard drives, wind turbines, and the drive motors of electric vehicles.\n\n3. **Lithium (Li)** - Lithium is primarily used to manufacture lithium batteries. As the demand for electric vehicles and portable electronic devices increases, the demand for lithium is rising rapidly.\n\n4. **Silver (Ag)** - Although silver is not as rare as the metals listed above, its industrial demand is huge. It is mainly used in electronics, solar panels, jewelry, and currency manufacturing.\n\n5. **Ruthenium (Ru)** - Ruthenium is a rare precious metal widely used for data storage in hard disk drives and large-capacity servers. It is also used in catalysts and electrochemical cells.\n\nThe demand for these metals is influenced by many factors, such as the global economy, technological development, and policy support. Moreover, as time passes and markets change, other rare metals such as tantalum, indium, rhenium, and other rare-earth metals may also appear on the list of most in-demand rare metals.", "contexts": ["Rare metals are metals with low abundance in the Earth's crust that are complex to mine and extract.", "Lithium (Li): Used in battery manufacturing.", "Cobalt (Co): Used in high-performance alloys and battery manufacturing."]}

Sample file: langstudio_eval_demo.jsonl

The large language model (LLM) and embedding connections required for the evaluation are created. For more information, see Configure connections.
Note: Some evaluation templates depend on a judge model or an embedding model. Therefore, you must configure the related LLM and embedding connections.

Billing

The application flow evaluation feature uses Object Storage Service (OSS) to store evaluation datasets and PAI-Deep Learning Containers (PAI-DLC) to run offline evaluation tasks. As a result, fees for resource usage are incurred. For more information, see OSS Billing overview and Billing of Deep Learning Containers (DLC).

Create an application flow evaluation task

After you debug an application flow on the orchestration page, click Evaluation in the upper-right corner to create an application flow evaluation task.

The following table describes the key parameters.

Parameter	Description
Evaluation dataset
OSS file	Select an evaluation dataset file in JSONL format from OSS. The dataset must contain a 'question' field and other fields required for the evaluation. The 'question' field is used as the input for the application flow. The other required fields are used to calculate metric scores. For more information, see the 'Input fields' section in Appendix: Preset evaluation templates.
Application flow input mapping
chat_history/question	Select the input fields for the application flow run. Note: Before you evaluate an application flow, you must first run it for inference. The evaluation tasks then run based on the inference results. Therefore, you must first select the input fields required for the application flow run.
Evaluation configuration
Preset template evaluation	The system provides multiple preset evaluation templates. You can select templates as needed. If you select multiple templates, the evaluation results are aggregated and displayed on the task details page. This topic uses the Answer Relevancy template as an example. When you select this template, complete the following configurations: Key parameters: Connection configuration > Judge LLM configuration: Select an LLM to act as the judge model. The judge model evaluates whether the query and response of the application flow match. We recommend that you select a powerful model, such as qwen-max. Evaluation template input mapping > query/response: Configure the input fields for the evaluation template. The inputs can come from the evaluation dataset or the output of the current application flow. In this example, the Answer Relevancy template accepts a question (query) and a specified application flow answer (response). The template then calculates whether the response accurately answers the query and provides an accuracy score. Therefore, the template's query must come from the 'question' field in the dataset, and the template's response must come from the application flow output. For more information about templates, see Appendix: Preset evaluation templates.
Resource configuration: These resources are used only for scheduling the evaluation task. We recommend that you select appropriate CPU resources based on the task complexity.

View evaluation results

After you submit an evaluation task, you are redirected to the task Overview page. Each evaluation run contains one batch run task and N metric evaluation tasks, where N is the number of selected templates. The batch run task uses the application flow to batch-process each line in the dataset and generate outputs. The metric evaluation tasks use auxiliary fields in the evaluation dataset to score each output from the batch run task. At the bottom of the page, you can view the details of each subtask. After the run is complete, you can view the trace, metrics, and output details for each subtask.

On the Metrics page, you can view all evaluation metric results. For more information about the metric names, see Appendix: Preset evaluation templates.

Appendix: Preset evaluation templates

LangStudio provides multiple built-in evaluation templates. You can use these templates to evaluate the performance of an application flow from multiple dimensions based on metric scores (metric values):

Template name	Description	Metric name	Required model service type	Input fields
Exact Match Evaluation	Evaluates whether the application flow output (response) is an exact match to the reference answer (reference). The score is between 0 and 1. A score of 0 indicates that the output does not match the reference. A score of 1 indicates an exact match.	exact_match_score	Not required	reference: String response: String
Answer Relevancy Evaluation	Evaluates the relevancy of the application flow output to the input. This method depends on an LLM. The LLM provides a score based on the input (query) and the application flow's answer (response). The score is between 0 and 1. A score of 0 indicates that the output is completely irrelevant to the input. A score of 1 indicates perfect relevance.	answer_relevancy	LLM	query: String response: String
Answer Correctness Evaluation	Evaluates whether the application flow's output is correct. This method depends on an LLM. The model provides a score based on the question (query) and the application flow's answer (response). The score is between 1 and 5. A score of 1 is the worst, and 5 is the best.	answer_correctness	LLM	query: String response: String
BLEU Score Evaluation	Evaluates the relevancy of the application flow output to the reference answer. This method uses the BLEU score as the evaluation metric. It calculates the relevancy score between the reference answer (reference) and the application flow output (response). The score is between 0 and 1. A score of 0 indicates complete irrelevance. A score of 1 indicates perfect relevance.	bleu-1/bleu-2/bleu-3/bleu-4	Not required	reference: String response: String
ROUGE Score Evaluation	Evaluates the relevancy of the application flow output (response) to the reference answer (reference). This method uses the ROUGE score as the evaluation metric. It calculates the relevancy score between the reference answer and the application flow output. The score is between 0 and 1. A score of 0 indicates complete irrelevance. A score of 1 indicates perfect relevance.	rouge-1-p/rouge-1-r/rouge-1-f/rouge-l-p/rouge-l-r/rouge-l-f	Not required	reference: String response: String
Context Relevancy Evaluation	Evaluates the relevancy of the context retrieved by the application flow to the input. This method depends on an LLM. The LLM provides a score based on the input (query) and the context. The score is between 0 and 1. A score of 0 indicates complete irrelevance. A score of 1 indicates perfect relevance.	context_relevancy	LLM	query: String contexts: List[String]
Answer Faithfulness Evaluation	Evaluates whether the application flow's answer is derived from the given context. This method depends on an LLM. The model provides a score based on the answer (response) and the context (contexts). The score is between 0 and 1. A score of 0 indicates a completely fabricated answer. A score of 1 indicates that the answer fully conforms to the context.	answer_faithfulness	LLM	response: String contexts: List[String]
Embedding Similarity Evaluation	Evaluates the embedding similarity between the application flow output (response) and the reference answer (reference). This method depends on an embedding model. It converts the reference answer and the application flow output into embedding vectors and then calculates their cosine similarity. The score is between 0 and 1. A higher value indicates greater similarity.	embedding_similarity	Embedding	reference: String response: String