Model evaluation (ModelEval) is a tool on the PAI platform that enables comprehensive and efficient evaluation of large language models (LLMs) in specific or general scenarios. It uses authoritative public datasets or custom business datasets to quantitatively analyze model capabilities, providing data to support model selection, fine-tuning, and version iteration.
Getting started: Complete your first model evaluation in 5 minutes
This section guides you through a simple evaluation task with minimal configuration. You will evaluate the Chinese comprehension and inference capabilities of the Qwen3-4B model using the public CMMLU dataset.
Log on to the PAI console. In the left navigation pane, choose Model Application > Model Evaluation (ModelEval).
On the Model Evaluation page, click Create Task.
Basic Configurations: Use the system-generated Task Name and Result Output Path.
NoteIf a default OSS storage path is not set for the workspace, manually select a result output path.
Configure Evaluation Mode: Select Single Model Evaluation.
Configure Evaluation Object:
Set Evaluation Object Type to Public Model.
In the Public Model drop-down list, search for and select
Qwen3-4B.
Configure Evaluation Method:
Select Evaluate with Public Dataset.
In the dataset list, select CMMLU.
Configure Resources:
Resource Group Type: Select Public Resource Group (Pay-As-You-Go).
Resource Configuration Method: Select Standard Resources.
Resource Specification: Select a GPU specification, for example
ecs.gn7i-c8g1.2xlarge(24 GB).If instances of this specification are out of stock, select another GPU-accelerated instance.
Submit the task: Click OK at the bottom of the page.
After you submit the task, the page automatically redirects to the task details. Wait for the task status to change to Succeeded. You can then view the performance of the Qwen3-4B model on the CMMLU dataset on the Evaluation Report tab.
Feature details
Configure evaluation objects
ModelEval supports four sources for evaluation objects. Choose one based on where your model or service is deployed.
Evaluation object type | Description | Scenarios |
Public Model | Models from the PAI Model Gallery | Quickly evaluate the performance of mainstream open source LLMs |
Custom Model | Custom models registered in Important Make sure the model is compatible with the vLLM framework. | Evaluate fine-tuned or customized models |
PAI-EAS Service | Deployed PAI-EAS online inference services | Evaluate model services in a production environment |
Custom Service | Any model service that complies with the OpenAI API specification | Evaluate third-party or self-built model services |
Configure evaluation methods
You can use a custom dataset, a public dataset, or a combination of both for the evaluation.
Evaluate using a custom dataset
Evaluate with your own dataset for results that closely match your business scenarios.
Dataset format: Must be in JSONL format with UTF-8 encoding. Each line must be a single JSON object.
Dataset upload: Upload the prepared dataset file to OSS and enter its OSS path on the configuration page.
Evaluation method | General metric evaluation | LLM-as-a-Judge evaluation |
Purpose | Use this method when you have a clear ground truth. It calculates the text similarity between the model's output and the ground truth. This is suitable for tasks such as translation, summarization, and knowledge base Q&A. | Use this method when there is no single correct answer to a question, such as in open-ended conversations or content creation. A powerful "LLM-as-a-Judge" is used to score the quality of the model's response. |
Dataset format | The JSON object must contain the | The JSON object can contain only the |
Core metrics |
| The system sends the |
Evaluate using a public dataset
Use industry-recognized, authoritative datasets to compare your model's capabilities against industry benchmarks.
Purpose: Compare models for selection, perform pre-release benchmark testing, and evaluate a model's general capabilities.
Configuration: Select Evaluate with Public Dataset and choose one or more datasets from the list.
Supported datasets:
LiveCodeBench: Evaluates code processing capabilities.
Math500: Evaluates mathematical reasoning capabilities (500 difficult math competition problems).
AIME25: Evaluates mathematical reasoning capabilities (based on problems from the 2025 American Invitational Mathematics Examination).
AIME24: Evaluates mathematical reasoning capabilities (based on problems from the 2024 American Invitational Mathematics Examination).
CMMLU: Evaluates Chinese multi-disciplinary language understanding.
MMLU: Evaluates English multi-disciplinary language understanding.
C-Eval: Evaluates comprehensive Chinese language capabilities.
GSM8K: Evaluates mathematical reasoning capabilities.
HellaSwag: Evaluates commonsense reasoning capabilities.
TruthfulQA: Evaluates truthfulness.
Task management
On the Model Evaluation page, you can manage the lifecycle of evaluation tasks.
View Report: For tasks with a status of Succeeded, click this button to view the detailed evaluation report.
Compare: Select two to five successful tasks and click the Compare button to compare their performance on various metrics side-by-side.
Stop: You can manually stop tasks that are Running. This operation is irreversible. The task cannot be resumed, and the consumed compute resources will not be refunded.
Delete: Deletes the task record. This operation cannot be undone.
Billing
The billable items for ModelEval are as follows:
Compute resources
Resource type | Billing method | Billable entity | Billing rule |
Public resources | Pay-as-you-go | Actual runtime. |
For specific instance unit prices, see the instance prices on the console page. |
Resource quota | Subscription | The quantity and subscription duration of the purchased node specifications. | Purchase dedicated resources with a subscription. You are charged based on the quantity and subscription duration of the node specifications. For more information, see AI Compute Resource Billing. |
LLM-as-a-Judge
When you select LLM-as-a-Judge evaluation as the evaluation method, additional fees apply.