All Products
Search
Document Center

Cloud Monitor:Large model evaluation

Last Updated:Sep 29, 2025

Cloud Monitor 2.0 supports the evaluation of text content, such as the inputs and outputs of large language models (LLMs) and the tool calls of agents. This evaluation involves a multi-faceted analysis of the outputs, behaviors, and performance of LLMs. You can create evaluation tasks, view the task list, and review the results. The results include detailed scores, input semantic analysis, topic distribution analysis, and a scoring dashboard.

The evaluation uses an LLM as the evaluator to provide a conclusion for each task.

Evaluation task descriptions

  • Evaluation tasks are divided into two types based on their result format:

    • Results are represented by a score with an accompanying explanation.

    • The results provide a semantic evaluation that enriches the original content with semantic information, such as topics and summaries.

  • Evaluation tasks are categorized into the following scenarios:

    • General scenario evaluation

    • Semantic evaluation

    • RAG evaluation

    • Agent evaluation

    • Tool use evaluation

1. General scenario evaluation

A score of 0 indicates that attention is required. A score of 1 indicates that no attention is required. A score between 0 and 1 indicates that partial attention is required.

No.

Evaluation task

Score of 0

Score of 1

1

Accuracy

Completely inaccurate

Completely accurate

2

Calculator correctness

Completely incorrect

Completely correct

3

Conciseness

Not concise

Completely concise

4

Contains code

Contains code

Does not contain code

5

Contains personally identifiable information

Contains personally identifiable information

Does not contain personally identifiable information

6

Contextual relevance

Completely irrelevant

Completely relevant

7

Taboo words

Contains taboo words

Does not contain taboo words

8

Hallucination

Hallucination is present

No hallucination

9

Hate speech

Contains hate speech

Does not contain hate speech

10

Usefulness

Completely useless

Very useful

11

Language detector

Cannot detect language

Accurately detects language

12

Open source

Is open source

Is not open source

13

Question is related to Python

Related to Python

Not related to Python

16

Toxicity

Is toxic

Is not toxic

2. Semantic evaluation

Semantic evaluation involves understanding and processing the semantics of data. It includes the following features.

  • Named Entity Recognition (NER)

    Extracts entities from text, such as names of people, places, organizations, and companies, time expressions, monetary amounts, percentages, legal documents, countries, regions, political entities, natural phenomena, works of art, events, languages, titles, images, and links.

  • Format information extraction

    Extracts content such as titles, lists, emphasized fonts (bold or italic), link names and URLs, image addresses, code blocks, and tables from Markdown or other text formats.

    Performs special processing on tables by converting each table into JSON format, where each column corresponds to a key and a value.

  • Key phrase extraction

    Extracts key phrases that represent the core semantics of long texts.

  • Numerical information extraction

    Extracts numerical values from text, including associated information such as temperature and price.

  • Extracting abstract information

    • User intent recognition: Identifies user intents, such as query and retrieval, text polishing, decision-making, and operational guidance.

    • Text summarization: Summarizes a text into a few sentences, with each sentence covering a single topic.

    • Sentiment classification: Determines whether the sentiment of the text is positive, negative, or neutral.

    • Topic classification: Identifies the topics in a text, such as sports, politics, and technology.

    • Role classification: Identifies the roles involved in the text, such as system, user, and doctor.

    • Language classification: Identifies the language of a text, such as Chinese and English.

  • Question generation

    Generates several questions from different perspectives based on the provided text.

3. RAG evaluation

No.

Evaluation task

Score of 0

Score of 1

1

Relevance of retrieved RAG corpus to the question

Completely irrelevant

Completely relevant

2

Relevance of retrieved RAG corpus to the answer

Completely irrelevant

Completely relevant

3

Duplication in the RAG corpus

Highly duplicated

No duplication

4

Diversity of the RAG corpus

Lowest diversity

Highest diversity

4. Agent evaluation

No.

Evaluation task

Score of 0

Score of 1

1

Clarity of agent instructions

Unclear

Clear

2

Errors in agent planning

Contains errors

Correct

3

Complexity of the agent task

Complex

Not complex

4

Errors in the agent execution path

Has errors

No errors

5

Whether the agent achieved the goal

Goal not achieved

Goal achieved

6

Conciseness of the agent execution path

Not concise

Concise

5. Tool use evaluation

No.

Evaluation task

Score of 0

Score of 1

1

Whether the plan calls a tool

No

Yes

2

Whether incorrect parameters are corrected when an error is encountered

Errors not corrected

Errors corrected

3

Correctness of the tool call

Incorrect

Correct

4

Errors in tool parameters

Has errors

No errors

5

Efficiency of the tool call

Low efficiency

High efficiency

6

Appropriateness of the tool

Inappropriate

Appropriate