If the system's built-in evaluators, such as relevance, security, and duplication, do not meet your business needs, you can create a custom evaluator. A custom evaluator uses a prompt that you define to instruct a large language model (LLM) to act as a judge. The LLM then scores the output of your AI application based on the dimensions and standards that you specify.
Prerequisites
An AI application has been created and connected to observable data.
Procedure
Step 1: Go to the Create Assessment Task page
Log on to the CloudMonitor 2.0 console and select the target workspace.
In the navigation pane on the left, under All Features, select AI Application Observability Assessment.
Click Assessment. On the assessment list page, click Create Assessment Task.
Step 2: Configure basic information
In the Basic Information section, configure the following parameters:
Parameter | Description |
Task Name | Enter a name for the assessment task. |
Data Source | Select the source type for the evaluation data. Only Pipeline is currently supported. |
AI Application | From the drop-down list, select the AI application to assess. |
Time Range | Select the time range for the assessment data. |
Step 3: Create a custom evaluator
In the Select Evaluator section, expand the LLM as Judge tab.
Click the Create Custom Evaluator card to open the configuration window.
In the configuration window that appears, configure the following parameters:
Parameter
Required
Description
Evaluator Name
Yes
Enter a name for the custom evaluator to identify it in the assessment task. For example: Technical Term Accuracy Assessment.
Metric Name
Yes
Define the metric ID for the assessment result that is displayed in the report. Use English characters or underscores. For example: pro_term_accuracy.
Evaluation Prompt
No
Write the judge prompt. This is the core configuration of the custom evaluator. Include the assessment dimensions, scoring criteria, and output requirements.
Assessment dimensions: Clearly tell the model what to check.
Scoring criteria: Define the scoring range, such as 0.0 to 1.0, and the specific meaning of each score.
Output requirements: Require the model to output in JSON format, including a
scoreand anexplanation(the reason for the score).
No
Map runtime variables from the application to placeholders in the prompt. This lets the evaluator access actual business data to make judgments.
Filter Assessment Data
No
Use filter statements to define which data enters the assessment flow.
Scope: Select the data layer where the assessment logic applies.
Span (Default): Assesses a single operation node in the call chain.
Trace: Assesses the entire call chain.
Session: Assesses the entire session.
Filter statement: Use tags such as service name and properties to precisely target the assessment object. For example:
serviceName = "your-service-name".
Configure variable mapping
Add mappings to map fields from Span data to placeholder variables in the prompt. The following fields are available for mapping:
Field | Description |
attributes.gen_ai.input.messages | Input messages |
attributes.gen_ai.output.messages | Output messages |
attributes.input.value | Input value |
attributes.output.value | Output value |
attributes.gen_ai.response.reasoning_content | Inference content |
attributes.retrieval.query | Retrieval query |
attributes.retrieval.document | Retrieval document |
attributes.reranker.input_document | Reranking input document |
attributes.reranker.output_document | Reranking output document |
attributes.gen_ai.tool.call.arguments | Tool calling arguments |
attributes.gen_ai.tool.call.result | Tool calling result |
attributes.gen_ai.tool.definitions | Tool definitions |
After you complete the configuration, review the filtered data in the Preview and Test area on the right to verify that the configuration is correct.
Click OK to finish creating the custom evaluator.
Step 4: Save and run the assessment task
Once created, the custom evaluator appears in the evaluator list.
Select other built-in evaluators as needed.
Click Save and Run to start the assessment task.
Preview and Test area description
When you configure a custom evaluator, the Preview and Test area on the right provides the following features:
Feature | Description |
Number of data entries | Displays the total amount of data that matches the filter criteria. |
Data navigation | Browse different data records using the Previous/Next buttons. |
Current span information | View the detailed Span properties of the currently selected data. |
Run test | After entering the evaluation prompt, run a test to validate the assessment logic. |
Assessment result | View test results in list or JSON format. |