Large model evaluation - Cloud Monitor - Alibaba Cloud Documentation Center

Cloud Monitor 2.0 supports the evaluation of text content, such as the inputs and outputs of large language models (LLMs) and the tool calls of agents. This evaluation involves a multi-faceted analysis of the outputs, behaviors, and performance of LLMs. You can create evaluation tasks, view the task list, and review the results. The results include detailed scores, input semantic analysis, topic distribution analysis, and a scoring dashboard.

The evaluation uses an LLM as the evaluator to provide a conclusion for each task.

Evaluation task descriptions

Evaluation tasks are divided into two types based on their result format:
- Results are represented by a score with an accompanying explanation.
- The results provide a semantic evaluation that enriches the original content with semantic information, such as topics and summaries.
Evaluation tasks are categorized into the following scenarios:
- General scenario evaluation
- Semantic evaluation
- RAG evaluation
- Agent evaluation
- Tool use evaluation

1. General scenario evaluation

A score of 0 indicates that attention is required. A score of 1 indicates that no attention is required. A score between 0 and 1 indicates that partial attention is required.

No.	Evaluation task	Score of 0	Score of 1
1	Accuracy	Completely inaccurate	Completely accurate
2	Calculator correctness	Completely incorrect	Completely correct
3	Conciseness	Not concise	Completely concise
4	Contains code	Contains code	Does not contain code
5	Contains personally identifiable information	Contains personally identifiable information	Does not contain personally identifiable information
6	Contextual relevance	Completely irrelevant	Completely relevant
7	Taboo words	Contains taboo words	Does not contain taboo words
8	Hallucination	Hallucination is present	No hallucination
9	Hate speech	Contains hate speech	Does not contain hate speech
10	Usefulness	Completely useless	Very useful
11	Language detector	Cannot detect language	Accurately detects language
12	Open source	Is open source	Is not open source
13	Question is related to Python	Related to Python	Not related to Python
16	Toxicity	Is toxic	Is not toxic

2. Semantic evaluation

Semantic evaluation involves understanding and processing the semantics of data. It includes the following features.

Named Entity Recognition (NER)
Extracts entities from text, such as names of people, places, organizations, and companies, time expressions, monetary amounts, percentages, legal documents, countries, regions, political entities, natural phenomena, works of art, events, languages, titles, images, and links.
Format information extraction
Extracts content such as titles, lists, emphasized fonts (bold or italic), link names and URLs, image addresses, code blocks, and tables from Markdown or other text formats.
Performs special processing on tables by converting each table into JSON format, where each column corresponds to a key and a value.
Key phrase extraction
Extracts key phrases that represent the core semantics of long texts.
Numerical information extraction
Extracts numerical values from text, including associated information such as temperature and price.
Extracting abstract information
- User intent recognition: Identifies user intents, such as query and retrieval, text polishing, decision-making, and operational guidance.
- Text summarization: Summarizes a text into a few sentences, with each sentence covering a single topic.
- Sentiment classification: Determines whether the sentiment of the text is positive, negative, or neutral.
- Topic classification: Identifies the topics in a text, such as sports, politics, and technology.
- Role classification: Identifies the roles involved in the text, such as system, user, and doctor.
- Language classification: Identifies the language of a text, such as Chinese and English.
Question generation
Generates several questions from different perspectives based on the provided text.

3. RAG evaluation

No.	Evaluation task	Score of 0	Score of 1
1	Relevance of retrieved RAG corpus to the question	Completely irrelevant	Completely relevant
2	Relevance of retrieved RAG corpus to the answer	Completely irrelevant	Completely relevant
3	Duplication in the RAG corpus	Highly duplicated	No duplication
4	Diversity of the RAG corpus	Lowest diversity	Highest diversity

4. Agent evaluation

No.	Evaluation task	Score of 0	Score of 1
1	Clarity of agent instructions	Unclear	Clear
2	Errors in agent planning	Contains errors	Correct
3	Complexity of the agent task	Complex	Not complex
4	Errors in the agent execution path	Has errors	No errors
5	Whether the agent achieved the goal	Goal not achieved	Goal achieved
6	Conciseness of the agent execution path	Not concise	Concise

5. Tool use evaluation

No.	Evaluation task	Score of 0	Score of 1
1	Whether the plan calls a tool	No	Yes
2	Whether incorrect parameters are corrected when an error is encountered	Errors not corrected	Errors corrected
3	Correctness of the tool call	Incorrect	Correct
4	Errors in tool parameters	Has errors	No errors
5	Efficiency of the tool call	Low efficiency	High efficiency
6	Appropriateness of the tool	Inappropriate	Appropriate