ModelEval evaluates large language models (LLMs) against public or custom datasets, providing quantitative metrics for model selection, fine-tuning, and iteration.
Getting started: Complete your first model evaluation in 5 minutes
This walkthrough evaluates Qwen3-4B for Chinese comprehension and inference using the public CMMLU dataset with minimal configuration.
-
Log on to the PAI console. In the left navigation pane, choose Model Application > Model Evaluation (ModelEval).
-
On the Model Evaluation page, click Create Task.
-
Basic Configurations: Use the system-generated Task Name and Result Output Path.
NoteIf a default OSS storage path is not set for the workspace, manually select a result output path.
-
Configure Evaluation Mode: Select Single Model Evaluation.
-
Configure Evaluation Object:
-
Set Evaluation Object Type to Public Model.
-
In the Public Model drop-down list, search for and select
Qwen3-4B.
-
-
Configure Evaluation Method:
-
Select Evaluate with Public Dataset.
-
In the dataset list, select CMMLU.
-
-
Configure Resources:
-
Resource Group Type: Select Public Resource Group (Pay-As-You-Go).
-
Resource Configuration Method: Select Standard Resources.
-
Resource Specification: Select a GPU specification, for example
ecs.gn7i-c8g1.2xlarge(24 GB).If instances of this specification are out of stock, select another GPU-accelerated instance.
-
-
Submit the task: Click OK at the bottom of the page.
After submission, the page redirects to task details. When the status changes to Succeeded, view Qwen3-4B CMMLU performance on the Evaluation Report tab.
Evaluation configuration
Configure evaluation objects
ModelEval supports four evaluation object sources. Choose based on where your model or service is deployed.
|
Evaluation object type |
Description |
Scenarios |
|
Public Model |
Models from the PAI Model Gallery |
Quickly evaluate the performance of mainstream open source LLMs |
|
Custom Model |
Custom models registered in Important
Make sure the model is compatible with the vLLM framework. |
Evaluate fine-tuned or customized models |
|
PAI-EAS Service |
Deployed PAI-EAS online inference services |
Evaluate model services in a production environment |
|
Custom Service |
Any model service that complies with the OpenAI API specification |
Evaluate third-party or self-built model services |
Configure evaluation methods
Use a custom dataset, a public dataset, or both.
Evaluate using a custom dataset
Use your own dataset for results that match your business scenarios.
-
Dataset format: Must be in JSONL format with UTF-8 encoding. Each line must be a single JSON object.
-
Dataset upload: Upload the prepared dataset file to OSS and enter its OSS path on the configuration page.
|
Evaluation method |
General metric evaluation |
LLM-as-a-Judge evaluation |
|
Purpose |
Best when you have clear ground truth. Calculates text similarity between model output and ground truth. Suitable for translation, summarization, and knowledge base Q&A. |
Best when no single correct answer exists, such as open-ended conversations or content creation. A powerful LLM judge scores the quality of model responses. |
|
Dataset format |
The JSON object must contain the
|
The JSON object can contain only the
|
|
Core metrics |
|
Sends the |
Evaluate using a public dataset
Use authoritative public datasets to benchmark your model against industry standards.
-
Purpose: Compare models for selection, perform pre-release benchmark testing, and evaluate a model's general capabilities.
-
Configuration: Select Evaluate with Public Dataset and choose one or more datasets from the list.
-
Supported datasets:
-
LiveCodeBench: Evaluates code processing capabilities.
-
Math500: Evaluates mathematical reasoning capabilities (500 difficult math competition problems).
-
AIME25: Evaluates mathematical reasoning capabilities (based on problems from the 2025 American Invitational Mathematics Examination).
-
AIME24: Evaluates mathematical reasoning capabilities (based on problems from the 2024 American Invitational Mathematics Examination).
-
CMMLU: Evaluates Chinese multi-disciplinary language understanding.
-
MMLU: Evaluates English multi-disciplinary language understanding.
-
C-Eval: Evaluates comprehensive Chinese language capabilities.
-
GSM8K: Evaluates mathematical reasoning capabilities.
-
HellaSwag: Evaluates commonsense reasoning capabilities.
-
TruthfulQA: Evaluates truthfulness.
-
Task management
Manage evaluation task lifecycle on the Model Evaluation page.
-
View Report: View the detailed evaluation report for Succeeded tasks.
-
Compare: Select two to five successful tasks to compare performance across metrics side-by-side.
-
Stop: Manually stop Running tasks. This is irreversible — the task cannot be resumed and consumed compute resources are not refunded.
-
Delete: Permanently deletes the task record.
Billing
ModelEval incurs the following costs:
Compute resources
|
Resource type |
Billing method |
Billable entity |
Billing rule |
|
Public resources |
Pay-as-you-go |
Actual runtime. |
Check instance unit prices on the console page. |
|
Resource quota |
Subscription |
Quantity and duration of purchased node specifications. |
Purchase dedicated resources with a subscription. Charges are based on the quantity and duration of node specifications. AI Compute Resource Billing. |
LLM-as-a-Judge
LLM-as-a-Judge evaluation incurs additional fees.