Cloud Monitor 2.0 supports the evaluation of text content, such as the inputs and outputs of large language models (LLMs) and the tool calls of agents. This evaluation involves a multi-faceted analysis of the outputs, behaviors, and performance of LLMs. You can create evaluation tasks, view the task list, and review the results. The results include detailed scores, input semantic analysis, topic distribution analysis, and a scoring dashboard.
The evaluation uses an LLM as the evaluator to provide a conclusion for each task.
Evaluation task descriptions
Evaluation tasks are divided into two types based on their result format:
Results are represented by a score with an accompanying explanation.
The results provide a semantic evaluation that enriches the original content with semantic information, such as topics and summaries.
Evaluation tasks are categorized into the following scenarios:
General scenario evaluation
Semantic evaluation
RAG evaluation
Agent evaluation
Tool use evaluation
1. General scenario evaluation
A score of 0 indicates that attention is required. A score of 1 indicates that no attention is required. A score between 0 and 1 indicates that partial attention is required.
No. | Evaluation task | Score of 0 | Score of 1 |
1 | Accuracy | Completely inaccurate | Completely accurate |
2 | Calculator correctness | Completely incorrect | Completely correct |
3 | Conciseness | Not concise | Completely concise |
4 | Contains code | Contains code | Does not contain code |
5 | Contains personally identifiable information | Contains personally identifiable information | Does not contain personally identifiable information |
6 | Contextual relevance | Completely irrelevant | Completely relevant |
7 | Taboo words | Contains taboo words | Does not contain taboo words |
8 | Hallucination | Hallucination is present | No hallucination |
9 | Hate speech | Contains hate speech | Does not contain hate speech |
10 | Usefulness | Completely useless | Very useful |
11 | Language detector | Cannot detect language | Accurately detects language |
12 | Open source | Is open source | Is not open source |
13 | Question is related to Python | Related to Python | Not related to Python |
16 | Toxicity | Is toxic | Is not toxic |
2. Semantic evaluation
Semantic evaluation involves understanding and processing the semantics of data. It includes the following features.
Named Entity Recognition (NER)
Extracts entities from text, such as names of people, places, organizations, and companies, time expressions, monetary amounts, percentages, legal documents, countries, regions, political entities, natural phenomena, works of art, events, languages, titles, images, and links.
Format information extraction
Extracts content such as titles, lists, emphasized fonts (bold or italic), link names and URLs, image addresses, code blocks, and tables from Markdown or other text formats.
Performs special processing on tables by converting each table into JSON format, where each column corresponds to a key and a value.
Key phrase extraction
Extracts key phrases that represent the core semantics of long texts.
Numerical information extraction
Extracts numerical values from text, including associated information such as temperature and price.
Extracting abstract information
User intent recognition: Identifies user intents, such as query and retrieval, text polishing, decision-making, and operational guidance.
Text summarization: Summarizes a text into a few sentences, with each sentence covering a single topic.
Sentiment classification: Determines whether the sentiment of the text is positive, negative, or neutral.
Topic classification: Identifies the topics in a text, such as sports, politics, and technology.
Role classification: Identifies the roles involved in the text, such as system, user, and doctor.
Language classification: Identifies the language of a text, such as Chinese and English.
Question generation
Generates several questions from different perspectives based on the provided text.
3. RAG evaluation
No. | Evaluation task | Score of 0 | Score of 1 |
1 | Relevance of retrieved RAG corpus to the question | Completely irrelevant | Completely relevant |
2 | Relevance of retrieved RAG corpus to the answer | Completely irrelevant | Completely relevant |
3 | Duplication in the RAG corpus | Highly duplicated | No duplication |
4 | Diversity of the RAG corpus | Lowest diversity | Highest diversity |
4. Agent evaluation
No. | Evaluation task | Score of 0 | Score of 1 |
1 | Clarity of agent instructions | Unclear | Clear |
2 | Errors in agent planning | Contains errors | Correct |
3 | Complexity of the agent task | Complex | Not complex |
4 | Errors in the agent execution path | Has errors | No errors |
5 | Whether the agent achieved the goal | Goal not achieved | Goal achieved |
6 | Conciseness of the agent execution path | Not concise | Concise |
5. Tool use evaluation
No. | Evaluation task | Score of 0 | Score of 1 |
1 | Whether the plan calls a tool | No | Yes |
2 | Whether incorrect parameters are corrected when an error is encountered | Errors not corrected | Errors corrected |
3 | Correctness of the tool call | Incorrect | Correct |
4 | Errors in tool parameters | Has errors | No errors |
5 | Efficiency of the tool call | Low efficiency | High efficiency |
6 | Appropriateness of the tool | Inappropriate | Appropriate |