Model evaluation (ModelEval) is a PAI tool that evaluates the performance of large language models (LLMs) in specific or general scenarios. You can use authoritative public datasets or your custom datasets to quantitatively analyze the capabilities of a model. This analysis provides data to support model selection, fine-tuning, and version iteration.
Getting started: Complete your first model evaluation in 5 minutes
This section describes how to complete a simple evaluation task with minimal configuration. You will evaluate the Qwen3-4B model using the public CMMLU dataset.
Log in to the PAI console. In the left navigation pane, choose Model Application > Model Evaluation (ModelEval).
On the Model Evaluation page, click Create Task.
Basic Configuration: The default Task Name and Result Output Path are system-generated.
NoteIf a default OSS storage path is not set for the workspace, you must manually specify a result output path.
Configure Evaluation Object:
Set Evaluation Object Type to Public Model.
In the Public Model drop-down list, search for and select
Qwen3-4B.
Configure Evaluation Method:
Select Public Dataset Evaluation.
In the dataset list, select CMMLU.
Configure Resources:
Set Resource Group Type to Public Resource Group (Pay-As-You-Go) and Resource Configuration Method to General Resource.
From the Resource Specification drop-down list, select a GPU specification, such as
ecs.gn7i-c8g1.2xlarge(24 GB).
Submit the task: Click OK at the bottom of the page.
After the task is submitted, the page automatically redirects to the task details. Once the task status changes to Succeeded, you can view the performance of the Qwen3-4B model on the CMMLU dataset on the Evaluation Report tab.
Features
Configure the evaluation object
Model evaluation supports four types of evaluation object sources. Select a source based on the deployment location of your model or service.
Evaluation object type | Description | Scenarios |
Public model | Models from the Model Gallery in PAI | Quickly evaluate the performance of mainstream open source large language models |
Custom model | Custom models registered in Important Ensure that the model is compatible with the vLLM framework. | Evaluate fine-tuned or customized models |
PAI-EAS service | Deployed PAI-EAS online inference services | Evaluate model services in a production environment |
Custom service | Any model service that complies with the OpenAI API specifications | Evaluate third-party or self-built model services |
Configure the evaluation method
You can use a custom dataset, a public dataset, or a combination of both for the evaluation.
Custom dataset evaluation
Use your own dataset to evaluate your model in a way that best reflects your actual business scenarios.
Dataset format: The dataset must be in the JSONL format and use UTF-8 encoding. Each line must be a single JSON object.
Dataset upload: You must upload the prepared dataset file to OSS and enter its OSS path on the configuration page.
Evaluation method | General metric evaluation | Judge model evaluation |
Use cases | Use this method when you have clear, standard answers. It calculates the text similarity between the model's output and the standard answer. This is suitable for tasks such as translation, summarization, and knowledge base Q&A. | Use this method when there is no single standard answer to a question, such as in open-ended conversations or content creation. A powerful "judge model" is used to score the quality of the model's response. |
Dataset format | The JSON object must contain the | The JSON object can contain only the |
Core metrics |
| The system sends the |
Public dataset evaluation
Use industry-recognized and authoritative datasets to compare the capabilities of your model against industry benchmarks.
Use cases: Comparing models for selection, performing pre-release benchmark testing, and evaluating the general capabilities of a model.
Configuration: Select Public Dataset Evaluation and choose one or more datasets from the list.
Supported datasets:
LiveCodeBench: Evaluates code processing capabilities.
Math500: Evaluates mathematical reasoning capabilities (500 difficult math competition problems).
AIME25: Evaluates mathematical reasoning capabilities (based on problems from the 2025 American Invitational Mathematics Examination).
AIME24: Evaluates mathematical reasoning capabilities (based on problems from the 2024 American Invitational Mathematics Examination).
CMMLU: Evaluates Chinese multi-disciplinary language understanding.
MMLU: Evaluates English multi-disciplinary language understanding.
C-Eval: Evaluates comprehensive Chinese language capabilities.
GSM8K: Evaluates mathematical reasoning capabilities.
HellaSwag: Evaluates commonsense reasoning capabilities.
TruthfulQA: Evaluates truthfulness.
Task management
On the Model Evaluation page, you can manage the lifecycle of your evaluation tasks.
View Report: For tasks with a status of Succeeded, click this button to view the detailed evaluation report.
Compare: You can select 2 to 5 successful tasks and click the Compare button to perform a side-by-side comparison of their performance on various metrics.
Stop: You can manually stop tasks that are Running. This operation is irreversible. The task cannot be resumed, and the consumed compute resources will not be refunded.
Delete: Deletes the task record. This operation cannot be undone.