Model evaluation (ModelEval) - Platform For AI - Alibaba Cloud Documentation Center

Model evaluation (ModelEval) is a tool on the PAI platform that enables comprehensive and efficient evaluation of large language models (LLMs) in specific or general scenarios. It uses authoritative public datasets or custom business datasets to quantitatively analyze model capabilities, providing data to support model selection, fine-tuning, and version iteration.

Getting started: Complete your first model evaluation in 5 minutes

This section guides you through a simple evaluation task with minimal configuration. You will evaluate the Chinese comprehension and inference capabilities of the Qwen3-4B model using the public CMMLU dataset.

Log on to the PAI console. In the left navigation pane, choose Model Application > Model Evaluation (ModelEval).
On the Model Evaluation page, click Create Task.
Basic Configurations: Use the system-generated Task Name and Result Output Path.
Note
If a default OSS storage path is not set for the workspace, manually select a result output path.
Configure Evaluation Mode: Select Single Model Evaluation.
Configure Evaluation Object:
- Set Evaluation Object Type to Public Model.
- In the Public Model drop-down list, search for and select Qwen3-4B.
Configure Evaluation Method:
- Select Evaluate with Public Dataset.
- In the dataset list, select CMMLU.
Configure Resources:
- Resource Group Type: Select Public Resource Group (Pay-As-You-Go).
- Resource Configuration Method: Select Standard Resources.
- Resource Specification: Select a GPU specification, for example ecs.gn7i-c8g1.2xlarge (24 GB).
  If instances of this specification are out of stock, select another GPU-accelerated instance.
Submit the task: Click OK at the bottom of the page.

After you submit the task, the page automatically redirects to the task details. Wait for the task status to change to Succeeded. You can then view the performance of the Qwen3-4B model on the CMMLU dataset on the Evaluation Report tab.

Feature details

Configure evaluation objects

ModelEval supports four sources for evaluation objects. Choose one based on where your model or service is deployed.

Evaluation object type	Description	Scenarios
Public Model	Models from the PAI Model Gallery	Quickly evaluate the performance of mainstream open source LLMs
Custom Model	Custom models registered in AI Asset Management > Models Important Make sure the model is compatible with the vLLM framework.	Evaluate fine-tuned or customized models
PAI-EAS Service	Deployed PAI-EAS online inference services	Evaluate model services in a production environment
Custom Service	Any model service that complies with the OpenAI API specification	Evaluate third-party or self-built model services

Configure evaluation methods

You can use a custom dataset, a public dataset, or a combination of both for the evaluation.

Evaluate using a custom dataset

Evaluate with your own dataset for results that closely match your business scenarios.

Dataset format: Must be in JSONL format with UTF-8 encoding. Each line must be a single JSON object.
Dataset upload: Upload the prepared dataset file to OSS and enter its OSS path on the configuration page.

Evaluation method	General metric evaluation	LLM-as-a-Judge evaluation
Purpose	Use this method when you have a clear ground truth. It calculates the text similarity between the model's output and the ground truth. This is suitable for tasks such as translation, summarization, and knowledge base Q&A.	Use this method when there is no single correct answer to a question, such as in open-ended conversations or content creation. A powerful "LLM-as-a-Judge" is used to score the quality of the model's response.
Dataset format	The JSON object must contain the `question` and `answer` (ground truth) fields. `{"question": "What is the capital of China?", "answer": "Beijing"}`	The JSON object can contain only the `question` field, or it can also provide an `answer` (ground truth) field. `{"question": "Please describe the history of artificial intelligence"}`
Core metrics	ROUGE (ROUGE-1, ROUGE-2, ROUGE-L): Based on recall, this measures how many information points from the ground truth are covered by the model's output. BLEU (BLEU-1, BLEU-2,BLEU-3, BLEU-4): Based on precision, this measures how much of the content in the model's output is accurate.	The system sends the `question` and the output of the model being evaluated to the LLM-as-a-Judge. The judge then provides a comprehensive score based on multiple dimensions, such as relevance, accuracy, and fluency.

Evaluate using a public dataset

Use industry-recognized, authoritative datasets to compare your model's capabilities against industry benchmarks.

Purpose: Compare models for selection, perform pre-release benchmark testing, and evaluate a model's general capabilities.
Configuration: Select Evaluate with Public Dataset and choose one or more datasets from the list.
Supported datasets:
- LiveCodeBench: Evaluates code processing capabilities.
- Math500: Evaluates mathematical reasoning capabilities (500 difficult math competition problems).
- AIME25: Evaluates mathematical reasoning capabilities (based on problems from the 2025 American Invitational Mathematics Examination).
- AIME24: Evaluates mathematical reasoning capabilities (based on problems from the 2024 American Invitational Mathematics Examination).
- CMMLU: Evaluates Chinese multi-disciplinary language understanding.
- MMLU: Evaluates English multi-disciplinary language understanding.
- C-Eval: Evaluates comprehensive Chinese language capabilities.
- GSM8K: Evaluates mathematical reasoning capabilities.
- HellaSwag: Evaluates commonsense reasoning capabilities.
- TruthfulQA: Evaluates truthfulness.

Task management

On the Model Evaluation page, you can manage the lifecycle of evaluation tasks.

View Report: For tasks with a status of Succeeded, click this button to view the detailed evaluation report.
Compare: Select two to five successful tasks and click the Compare button to compare their performance on various metrics side-by-side.
Stop: You can manually stop tasks that are Running. This operation is irreversible. The task cannot be resumed, and the consumed compute resources will not be refunded.
Delete: Deletes the task record. This operation cannot be undone.

Billing

The billable items for ModelEval are as follows:

Compute resources

Resource type

Billing method

Billable entity

Billing rule

Public resources

Pay-as-you-go

Actual runtime.

Bill amount = (Unit price / 60) × Service duration (in minutes)

For specific instance unit prices, see the instance prices on the console page.

Resource quota

Subscription

The quantity and subscription duration of the purchased node specifications.

Purchase dedicated resources with a subscription. You are charged based on the quantity and subscription duration of the node specifications. For more information, see AI Compute Resource Billing.

LLM-as-a-Judge

When you select LLM-as-a-Judge evaluation as the evaluation method, additional fees apply.