Evaluate LLM performance by comparing foundation models, fine-tuned versions, and quantized versions using custom or public datasets with automated metrics.
Background
Introduction
As LLMs become more capable, scientific and efficient evaluation becomes increasingly important for measuring and comparing model performance. Evaluation guides model selection and optimization, accelerating AI innovation and application.
PAI provides best practices for LLM evaluation to help AI developers build evaluation processes that reflect true model performance and meet specific industry needs. Topics covered:
-
Prepare and select evaluation datasets
-
Select open source or fine-tuned models
-
Create evaluation tasks and select evaluation methods
-
Interpret task results in single-task or multi-task scenarios
Platform features
PAI LLM evaluation compares model performance across scenarios:
-
Compare foundation models: Qwen2-7B-Instruct vs. Baichuan2-7B-Chat
-
Compare fine-tuned versions of the same model (e.g., different epoch versions of Qwen2-7B-Instruct trained on private data)
-
Compare quantized versions: Qwen2-7B-Instruct-GPTQ-Int4 vs. Qwen2-7B-Instruct-GPTQ-Int8
PAI addresses needs of enterprise developers and algorithm researchers by combining custom datasets with public datasets (MMLU, C-Eval) for comprehensive, accurate, and targeted model evaluation. Features:
-
End-to-end evaluation pipeline with no code development. Supports mainstream open source LLMs and one-click evaluation after fine-tuning.
-
Upload custom datasets. 10+ built-in NLP evaluation methods with consolidated results display.
-
Evaluate on public datasets from multiple domains. Fully reproduces official evaluation methods with panoramic radar chart view.
-
Simultaneous evaluation of multiple models and tasks with comparative charts and detailed per-item results.
-
Transparent, reproducible evaluation. Evaluation code is open source in the eval-scope repository, co-built with ModelScope.
Billing
-
LLM evaluation relies on PAI QuickStart (free). Evaluation tasks may incur DLC fees. For more information, see Billing of Deep Learning Containers (DLC).
-
Custom dataset evaluation incurs OSS fees. For more information, see Billing overview of OSS.
Scenario 1: Custom dataset evaluation for enterprise developers
Enterprises often accumulate rich private data. A key part of using LLMs for algorithm optimization is leveraging this data. Enterprise developers evaluate open source or fine-tuned LLMs using custom datasets from private data to better understand model performance in specific domains.
For custom dataset evaluation, PAI uses standard text-matching methods from NLP to calculate similarity between model output and ground truth. Higher values indicate better models.
Key process steps (for details, see Model evaluation):
-
Prepare a custom evaluation set.
-
Format:
Prepare evaluation set file in JSONL format. Example: llmuses_general_qa_test.jsonl (76 KB):
[{"question": "Is it true that China invented papermaking?", "answer": "True"}] [{"question": "Is it true that China invented gunpowder?", "answer": "True"}]Use
questionto identify question column andanswerto identify answer column. -
Upload evaluation set file to OSS. For more information, see Upload files to OSS.
-
Create dataset from OSS file. For more information, see Create a dataset from an Alibaba Cloud product.
-
-
Select a model.
Use an open source model
In the PAI console, go to Quick Start > Model Gallery. Hover over a model card to display the Evaluate button for supported models.

Use a fine-tuned model
In the PAI console, go to Quick Start > Model Gallery. Hover over a model card to display the Evaluate button. After fine-tuning, go to Quick Start > Model Gallery > Job Management > Training Jobs. Click a completed training job to display the Evaluate button.

Model evaluation currently supports all AutoModelForCausalLM type models from Hugging Face.
-
Create and run evaluation task.
Click Evaluate in the model detail page to create an evaluation task.

Key parameters:
Parameter
Description
Base configuration
Result Output Path
OSS path where evaluation results are saved.
Custom Dataset Configuration
Evaluation Method
Options:
-
General Metric Evaluation: Calculates text similarity between model predictions and reference answers (ROUGE, BLEU). Suitable for definitive answers.
-
Judge Model Evaluation: Uses LLM-as-a-Judge model to automatically score answers. Reference answers not required. Suitable for complex or non-unique answers. Results include overall score and 5 specific metrics.
LLM-as-a-Judge Service Token
Required when Evaluation Method is LLM-as-a-Judge Evaluation. Obtain token from LLM-as-a-Judge page.
Dataset Source
Select existing dataset.
Select an existing dataset.
Select custom dataset created earlier.
Resource Configuration
Resource Group Type
Select public resource group, general computing resources, or Lingjun resources.
Job Resource
If Resource Group Type is public resource group, system recommends resources based on model specifications.
Click Submit to start the task.
-
-
View evaluation results.
Single task result
When Status for an evaluation task on Quick Start > Model Gallery > Job Management > Evaluation Jobs page is Succeeded, click Actions > View Report to view ROUGE and BLEU scores on the Custom Dataset Evaluation Result page.

Page also displays detailed evaluation results for each data entry.
Multi-task comparison result
On QuickStart > Model Gallery > Job Management > Evaluation Jobs page, select model evaluation tasks to compare. Click Compare to view comparison results on the Custom Dataset Evaluation Result page.

Result analysis:
Default evaluation methods for custom datasets: rouge-1-f, rouge-1-p, rouge-1-r, rouge-2-f, rouge-2-p, rouge-2-r, rouge-l-f, rouge-l-p, rouge-l-r, bleu-1, bleu-2, bleu-3, and bleu-4.
-
ROUGE-n metrics calculate overlap of N-grams (N consecutive words). ROUGE-1 and ROUGE-2 are most common, corresponding to unigrams and bigrams. ROUGE-L is based on Longest Common Subsequence (LCS).
-
BLEU (Bilingual Evaluation Understudy) evaluates machine translation quality by measuring N-gram overlap between machine translation output and reference translations. BLEU-n calculates N-gram match rate.
Final evaluation results are saved to the Output Path set earlier.
-
Scenario 2: Public dataset evaluation for algorithm researchers
Algorithm research often relies on public datasets. When researchers select open source models or fine-tune models, they refer to evaluation performance on authoritative public datasets. PAI provides access to public datasets from multiple domains and fully reproduces official evaluation metrics to obtain accurate performance feedback, facilitating efficient LLM research.
Public dataset evaluation assesses comprehensive LLM capabilities (mathematical, knowledge, reasoning) by classifying open source evaluation datasets by domain. Higher values indicate better models.
Key process steps (for details, see Model evaluation):
-
Supported public datasets:
PAI currently maintains public datasets including MMLU, TriviaQA, HellaSwag, GSM8K, C-Eval, CMMLU, and TruthfulQA. More datasets are being added.
Dataset
Size
Data volume
Domain
166 MB
14042
Knowledge
14.3 MB
17944
Knowledge
1.55 MB
12342
Chinese
1.08 MB
11582
Chinese
4.17 MB
1319
Math
47.5 MB
10042
Reasoning
0.284 MB
816
Security
-
Select a model.
Use an open source model
In the PAI console, go to QuickStart > Model Gallery. Hover over a model card to display the Evaluate button for supported models.

Use a fine-tuned model
In the PAI console, go to QuickStart > Model Gallery. Hover over a model card to display the Evaluate button. After fine-tuning an evaluable model, go to QuickStart > Model Gallery > Job Management > Training Jobs. Click the successfully trained job to display the Evaluate button.

Model evaluation currently supports all AutoModelForCausalLM type models from Hugging Face.
-
Create and run evaluation task.
Click Evaluate in the model detail page to create an evaluation task.

Parameter
Description
Base configuration
Result Output Path
OSS path where evaluation results are saved.
Public Dataset Configuration
Public Dataset
Select a public dataset.
Resource Configuration
Resource Group Type
Select public resource group, general computing resources, or Lingjun resources.
Job Resource
If Resource Group Type is public resource group, system recommends resources based on model specifications.
Click Submit to start the task.
-
View evaluation results.
Single task result
When Status of an evaluation task on Quick Start > Model Gallery > Job Management > Evaluation Jobs page changes to Succeeded, click View Report in the Actions column to view model scores for various realms and datasets on the Custom Dataset Evaluation Result page.

Multi-task comparison result
On Quick Start > Model Gallery > Job Management > Evaluation Jobs page, select model evaluation tasks to compare and click Compare to view comparison results on the Evaluation Results of Public Datasets page.

Result analysis:
-
Left chart shows model scores in different domains. Each domain may have multiple related datasets. For datasets in the same domain, PAI calculates the mean of model scores as the domain score.
-
Right chart shows model scores on each public dataset. For more information about evaluation scope of each dataset, see Supported public datasets.
Final evaluation results are saved to the Output Path set earlier.
-