Use evaluation tasks to measure the end-to-end quality of conversational search and compare different configurations — model, prompt, and retrieval settings — to identify which setup produces the best results for your use case. The evaluated pipeline covers three stages: a user submits a question, the system retrieves relevant documents, and a large language model (LLM) generates an answer.
Evaluation tasks are billed based on the computing resources consumed during the evaluation.
Prerequisites
Before you begin, make sure you have:
An OpenSearch LLM-Based Conversational Search Edition instance
A prompt template configured. For details, see Manage prompts
An evaluation dataset ready to use
Create an evaluation task
Log on to the OpenSearch console.
In the top navigation bar, select the region where your instance resides. In the upper-left corner, select OpenSearch LLM-Based Conversational Search Edition.
On the Instance Management page, find your instance and click Manage in the Actions column. On the instance details page, click Effect Comparison in the left-side pane.
On the Evaluation Task tab, click Create Evaluation Task.
Enter a task name, select an evaluation dataset, and click Configure Parameters.
In the Configure Parameters panel, set the parameters described in the following sections, then click OK.
After the evaluation completes, the system generates an overall score. Click Evaluation Report to view per-Q&A-pair scores and details. If any result appears inaccurate, click Manual Evaluation to revise the score manually.
Parameter reference
Select model and prompt
| Parameter | Description |
|---|---|
| Select Model | The model used for conversational search. For available models, see Model management. An available model is one that can be used to test the effects of conversational search. |
| Prompt | The prompt template used for conversational search. Configure a prompt template before creating the task. For details, see Manage prompts. |
Prompt parameters
These parameters control how the LLM generates answers.
| Parameter | Type | Required | Default | Valid values | Description |
|---|---|---|---|---|---|
attitude | String | No | normal | normal, polite, patience | The tone of the conversation. |
rule | String | No | simple | detailed, stepbystep | The level of detail in the answer. |
noanswer | String | No | sorry | sorry, uncertain | The response returned when the system cannot find an answer. |
language | String | No | Chinese | Chinese, English, Thai, Korean | The language of the generated answer. |
role | Boolean | No | true | — | Specifies whether to use a custom role to answer questions. |
role_name | String | No | AI Assistant | — | The name of the custom role. Example: AI Assistant. |
out_format | String | No | text | text, table, list, markdown | The format of the generated answer. |
Document retrieval parameters
These parameters control how the system retrieves documents from your data source.
| Parameter | Type | Required | Default | Valid values | Description |
|---|---|---|---|---|---|
filter | String | No | — | — | The field expression used to filter documents. Example: filter = field = value. |
top_n | INT | No | 5 | (0, 50] | The number of documents to retrieve. |
sf | Float | No | 1.3 | [0, +∞) | The vector similarity threshold for retrieved documents. A higher value means lower similarity is required. |
dense_weight | Float | No | 0.7 | (0, 1) | The weight of the dense vector. Available when a sparse vector model is selected. The sparse vector weight equals 1 - dense_weight. |
formula | String | No | Vector similarity | — | The formula used to rank retrieved documents. |
operator | String | No | AND | — | The operator applied between text tokens during text retrieval. |
Reference image parameters
These parameters apply when your retrieval pipeline includes image data.
| Parameter | Type | Required | Default | Valid values | Description |
|---|---|---|---|---|---|
sf | Float | No | 1 | [0, +∞) | The vector similarity threshold for reference images. For sparse vector models, a higher value means greater similarity. For dense vector models, a higher value means lower similarity. |
dense_weight | Float | No | 0.7 | (0, 1) | The weight of the dense vector. The sparse vector weight equals 1 - dense_weight. |
Query understanding parameters
These parameters control how the system interprets and expands the user's query before retrieval.
| Parameter | Type | Required | Default | Valid values | Description |
|---|---|---|---|---|---|
query_extend | Boolean | No | false | — | Specifies whether to enable query expansion. Enabling this can improve retrieval performance. |
query_exten_num | INT | No | 5 | (0, +∞) | The number of expanded queries to generate. |
Manual intervention parameters
| Parameter | Type | Required | Default | Valid values | Description |
|---|---|---|---|---|---|
sf | Float | No | 0.3 | [0, 2] | The similarity threshold for matching manual intervention entries. A higher value makes it easier to trigger a match. |
Other parameters
| Parameter | Type | Required | Default | Valid values | Description |
|---|---|---|---|---|---|
return_hits | Boolean | No | false | — | Specifies whether to include document retrieval results in the response. |
csi_level | String | No | strict | none, loose, strict | Content moderation level. none: no moderation. loose: blocks restricted content. strict: blocks restricted and suspicious content. |
history_max | INT | No | 20 | (0, 20] | The maximum number of conversation rounds the system uses to generate a response. |
link | Boolean | No | false | — | Specifies whether to return the source of each retrieved document. |
Related topics
Model management — review available models to use in your evaluation task
Manage prompts — configure prompt templates before creating an evaluation task