PAI ModelEval - evaluate and benchmark LLMs - Platform For AI

Getting started: Complete your first model evaluation in 5 minutes

This walkthrough evaluates Qwen3-4B for Chinese comprehension and inference using the public CMMLU dataset with minimal configuration.

Log on to the PAI console. In the left navigation pane, choose Model Application > Model Evaluation (ModelEval).
On the Model Evaluation page, click Create Task.
Basic Configurations: Use the system-generated Task Name and Result Output Path.

Note
If a default OSS storage path is not set for the workspace, manually select a result output path.
Configure Evaluation Mode: Select Single Model Evaluation.
Configure Evaluation Object:
- Set Evaluation Object Type to Public Model.
- In the Public Model drop-down list, search for and select Qwen3-4B.
Configure Evaluation Method:
- Select Evaluate with Public Dataset.
- In the dataset list, select CMMLU.
Configure Resources:
- Resource Group Type: Select Public Resource Group (Pay-As-You-Go).
- Resource Configuration Method: Select Standard Resources.
- Resource Specification: Select a GPU specification, for example ecs.gn7i-c8g1.2xlarge (24 GB).
  
  If instances of this specification are out of stock, select another GPU-accelerated instance.
Submit the task: Click OK at the bottom of the page.

After submission, the page redirects to task details. When the status changes to Succeeded, view Qwen3-4B CMMLU performance on the Evaluation Report tab.

Evaluation configuration

Configure evaluation objects

ModelEval supports four evaluation object sources. Choose based on where your model or service is deployed.

Evaluation object type	Description	Scenarios
Public Model	Models from the PAI Model Gallery	Quickly evaluate the performance of mainstream open source LLMs
Custom Model	Custom models registered in AI Asset Management > Models Important Make sure the model is compatible with the vLLM framework.	Evaluate fine-tuned or customized models
PAI-EAS Service	Deployed PAI-EAS online inference services	Evaluate model services in a production environment
Custom Service	Any model service that complies with the OpenAI API specification	Evaluate third-party or self-built model services

Configure evaluation methods

Use a custom dataset, a public dataset, or both.

Evaluate using a custom dataset

Use your own dataset for results that match your business scenarios.

Dataset format: Must be in JSONL format with UTF-8 encoding. Each line must be a single JSON object.
Dataset upload: Upload the prepared dataset file to OSS and enter its OSS path on the configuration page.

Evaluation method	General metric evaluation	LLM-as-a-Judge evaluation
Purpose	Best when you have clear ground truth. Calculates text similarity between model output and ground truth. Suitable for translation, summarization, and knowledge base Q&A.	Best when no single correct answer exists, such as open-ended conversations or content creation. A powerful LLM judge scores the quality of model responses.
Dataset format	The JSON object must contain the `question` and `answer` (ground truth) fields. `{"question": "What is the capital of China?", "answer": "Beijing"}`	The JSON object can contain only the `question` field, or it can also provide an `answer` (ground truth) field. `{"question": "Please describe the history of artificial intelligence"}`
Core metrics	ROUGE (ROUGE-1, ROUGE-2, ROUGE-L): Recall-based metric measuring how many ground truth information points are covered by model output. BLEU (BLEU-1, BLEU-2, BLEU-3, BLEU-4): Precision-based metric measuring how much of the model output content is accurate.	Sends the `question` and model output to the LLM judge, which scores responses on relevance, accuracy, and fluency.

Evaluate using a public dataset

Use authoritative public datasets to benchmark your model against industry standards.

Purpose: Compare models for selection, perform pre-release benchmark testing, and evaluate a model's general capabilities.
Configuration: Select Evaluate with Public Dataset and choose one or more datasets from the list.
Supported datasets:
- LiveCodeBench: Evaluates code processing capabilities.
- Math500: Evaluates mathematical reasoning capabilities (500 difficult math competition problems).
- AIME25: Evaluates mathematical reasoning capabilities (based on problems from the 2025 American Invitational Mathematics Examination).
- AIME24: Evaluates mathematical reasoning capabilities (based on problems from the 2024 American Invitational Mathematics Examination).
- CMMLU: Evaluates Chinese multi-disciplinary language understanding.
- MMLU: Evaluates English multi-disciplinary language understanding.
- C-Eval: Evaluates comprehensive Chinese language capabilities.
- GSM8K: Evaluates mathematical reasoning capabilities.
- HellaSwag: Evaluates commonsense reasoning capabilities.
- TruthfulQA: Evaluates truthfulness.

Task management

Manage evaluation task lifecycle on the Model Evaluation page.

View Report: View the detailed evaluation report for Succeeded tasks.
Compare: Select two to five successful tasks to compare performance across metrics side-by-side.
Stop: Manually stop Running tasks. This is irreversible — the task cannot be resumed and consumed compute resources are not refunded.
Delete: Permanently deletes the task record.

Billing

ModelEval incurs the following costs:

Compute resources

Resource type

Billing method

Billable entity

Billing rule

Public resources

Pay-as-you-go

Actual runtime.

Bill amount = (Unit price / 60) × Service duration (in minutes)

Check instance unit prices on the console page.

Resource quota

Subscription

Quantity and duration of purchased node specifications.

Purchase dedicated resources with a subscription. Charges are based on the quantity and duration of node specifications. AI Compute Resource Billing.

LLM-as-a-Judge

LLM-as-a-Judge evaluation incurs additional fees.