Evaluating Large Language Models (LLMs) is critical for measuring performance, selecting the right model, and optimizing it to accelerate AI innovation and deployment. The PAI model evaluation platform supports a variety of evaluation scenarios, such as comparing different foundation models, fine-tuned versions, and quantized versions. This document guides you on how to perform comprehensive and targeted model evaluation for different user groups and dataset types to achieve better results in the AI field.
Background information
Introduction
As LLM become more powerful, the need for rigorous model evaluation is greater than ever. A scientific and efficient evaluation process not only helps developers measure and compare the performance of different models but also guides them in model selection and optimization. This accelerates the adoption of AI innovations. This makes a platform-based set of best practices for LLM evaluation essential.
This document provides best practices for using the Platform for AI (PAI) model evaluation service. This guide helps you build a comprehensive evaluation process that reflects a model's true performance and meets your specific industry needs, helping you excel in artificial intelligence. These best practices cover the following topics:
How to prepare and select an evaluation dataset.
How to select an open-source or fine-tuned model that fits your business needs.
How to create an evaluation job and choose appropriate evaluation metrics.
How to interpret evaluation results for single-job and multi-job scenarios.
Platform features
The PAI model evaluation platform helps you compare model performance across different evaluation scenarios. For example:
Comparing different foundation models, such as Qwen2-7B-Instruct vs. Baichuan2-7B-Chat.
Comparing different fine-tuned versions of the same model, such as the performance of different epoch versions of Qwen2-7B-Instruct trained on your private data.
Comparing different quantized versions of the same model, such as Qwen2-7B-Instruct-GPTQ-Int4 vs. Qwen2-7B-Instruct-GPTQ-Int8.
This guide uses enterprise developers and algorithm researchers as examples to meet the needs of different user groups. It explains how to combine your own custom dataset with common public datasets (such as MMLU or C-Eval) to achieve a more comprehensive, accurate, and targeted model evaluation. This approach helps you find the best LLM for your business. The key features of this practice are:
Provides an end-to-end, no-code evaluation workflow. It supports mainstream open-source LLMs and one-click evaluation for fine-tuned models.
Allows you to upload a custom dataset. It includes over 10 built-in general Natural Language Processing (NLP) evaluation metrics and displays results in a dashboard-style view, eliminating the need to develop evaluation scripts.
Supports evaluation on popular public datasets across multiple domains. It fully replicates official evaluation methods and presents a holistic view with radar charts, eliminating the need to download datasets and learn separate evaluation procedures.
Supports simultaneous evaluation of multiple models and jobs. It displays comparison results in charts and provides detailed results for each sample, enabling comprehensive analysis.
Ensures a transparent and reproducible evaluation process. The evaluation code is open-source and available at the eval-scope repository, co-developed with ModelScope, allowing you to review details and reproduce the results.
Billing
The PAI model evaluation service is built on PAI-QuickStart. QuickStart is free, but running model evaluations may incur fees for Distributed Training (DLC) jobs. For more information about billing, see Billing of Deep Learning Containers (DLC).
If you evaluate a model with a custom dataset stored in Object Storage Service (OSS), OSS usage incurs additional charges. For more information about billing, see Billing overview of OSS.
Use case 1: Evaluate models with a custom dataset for enterprise developers
Enterprises often have extensive private, domain-specific data. Leveraging this data is key to optimizing algorithms with LLMs. Therefore, when enterprise developers evaluate an open-source or fine-tuned LLM, they often use a custom dataset from their private data to better understand the model's performance in that specific context.
For evaluation with a custom dataset, the PAI model evaluation platform uses standard NLP text-matching methods to calculate the similarity between the model's output and the ground-truth answers. A higher score indicates a better model. This method lets you use your unique, scenario-specific data to determine if a model suits your needs.
The following steps highlight the key points of the process. For detailed instructions, see Model evaluation.
Prepare a custom dataset.
Custom dataset format:
To run a custom dataset evaluation, prepare your data in JSONL format. For an example file, see llmuses_general_qa_test.jsonl (76 KB). The format is as follows:
[{"question": "Is it correct that China invented papermaking?", "answer": "Correct"}] [{"question": "Is it correct that China invented gunpowder?", "answer": "Correct"}]Use
questionto identify the question column andanswerto identify the answer column.Upload the formatted dataset file to OSS. For more information, see Upload files to OSS.
Create a dataset from the file in OSS. For more information, see Create a dataset: From an Alibaba Cloud service.
Select a model for your use case.
Use an open source model
In the navigation pane on the left of the PAI console, choose QuickStart > Model Gallery. Hover over a model card. If the model supports evaluation, an Evaluate button appears.

Use a fine-tuned model
In the navigation pane on the left of the PAI console, choose QuickStart > Model Gallery. Hover the mouse over a model card. The Evaluation button appears on models that can be evaluated. Fine-tune a model that can be evaluated. Then, on the QuickStart > Model Gallery > Job Management > Training Jobs page, click a successfully trained job. The Evaluate button appears in the upper-right corner.

Model evaluation currently supports all AutoModelForCausalLM models from Hugging Face.
Create and run an evaluation Job.
On the model product page, click Evaluate in the upper-right corner to create an evaluation job.

Configure the key parameters as follows:
Parameter
Description
Basic Configuration
Result Output Path
Specify the OSS path where the final evaluation results will be saved.
Custom Dataset Configuration
Evaluation Method
Select one of the following options:
General Metric Evaluation: Calculates text similarity between the model's output and the reference answer using metrics like ROUGE and BLEU. Suitable for scenarios with definite answers.
Judge Model Evaluation: Uses a PAI-provided judge model to score answers automatically. This method does not require reference answers and is suitable for scenarios with complex or non-unique answers. The result includes an overall score and five sub-scores.
Judge Model Service Token
When selecting Judge Model Evaluation as the evaluation method, configure this parameter. You can obtain the token from the Judge Model page.
Dataset Source
Select an existing dataset.
Create a dataset that is stored in Alibaba Cloud storage
Select the custom dataset you created earlier.
Resource Configuration
Resource Group Type
Select a public resource group, general-purpose computing resources, or Lingjun resources based on your needs.
Job Resource
If you select a public resource group, a suitable resource specification is recommended by default based on your model size.
Click OK to start the job.
View the evaluation results.
Single-job results
On the QuickStart > Model Gallery > Job Management > Evaluation Jobs page, when the Status of the evaluation job is Succeeded, click View Report in the Operation column. On the Custom Dataset Evaluation Results page, you can view the model's scores for ROUGE and BLEU metrics.

The report also provides detailed evaluation results for each data entry in the evaluation file.
Multi-job comparison results
On the QuickStart > Model Gallery > Job Management > Evaluation Jobs page, select the model evaluation jobs that you want to compare and click Compare in the upper-right corner. On the Custom Dataset Evaluation Results page, you can view the comparison results.

Interpreting the evaluation results:
The default evaluation metrics for a custom dataset include: rouge-1-f, rouge-1-p, rouge-1-r, rouge-2-f, rouge-2-p, rouge-2-r, rouge-l-f, rouge-l-p, rouge-l-r, bleu-1, bleu-2, bleu-3, and bleu-4.
rouge-n metrics calculate the overlap of N-grams (N consecutive words). rouge-1 and rouge-2 are the most commonly used, corresponding to unigrams and bigrams, respectively. The rouge-l metric is based on the Longest Common Subsequence (LCS).
BLEU (Bilingual Evaluation Understudy) is another popular metric for evaluating the quality of machine translation. It calculates a score by measuring the N-gram overlap between the machine translation output and a set of reference translations. The bleu-n metric calculates the similarity of N-grams.
The final evaluation results are saved to the Result Output Path you specified.
Use case 2: Evaluate models with a public dataset for algorithm researchers
Algorithm research often relies on public datasets. When researchers select an open-source model or fine-tune a model, they refer to its performance on authoritative public benchmarks. However, due to the vast variety of public datasets for LLMs, researchers often spend significant time selecting relevant datasets for their domain and learning their corresponding evaluation procedures. To simplify this, PAI integrates multiple public datasets and fully replicates the official evaluation metrics for each. This provides accurate feedback on model performance and helps accelerate LLM research.
For evaluation with a public dataset, the PAI model evaluation platform categorizes open-source datasets by domain to assess an LLM's comprehensive capabilities, such as math, knowledge, and reasoning. A higher score indicates a better model. This is the most common method for evaluating LLMs.
The following steps highlight the key points of the process. For detailed instructions, see Model evaluation.
Supported public datasets:
PAI currently maintains public datasets including MMLU, TriviaQA, HellaSwag, GSM8K, C-Eval, CMMLU, and TruthfulQA. More public datasets are being added.
Dataset
Size
Data volume
Realm
166 MB
14042
Knowledge
14.3 MB
17944
Knowledge
1.55 MB
12342
Chinese
1.08 MB
11582
Chinese
4.17 MB
1319
Math
47.5 MB
10042
Reasoning
0.284 MB
816
Security
Select a model suitable for your use case.
Use an open source model
In the navigation pane on the left of the PAI console, choose QuickStart > Model Gallery. Hover the mouse over a model card. If the model supports evaluation, the Evaluate button appears.

Use a fine-tuned model
In the left navigation pane of the PAI console, choose QuickStart > Model Gallery. Hover over a model card. The Evaluation button is displayed for models that can be evaluated. After you Fine-tune a model that can be evaluated, go to the QuickStart > Model Gallery > Job Management > Training Jobs page and click a successfully trained job. The Evaluation button is then displayed in the upper-right corner.

Model evaluation currently supports all AutoModelForCausalLM models from Hugging Face.
Create and run an evaluation Job.
On the model details page, click Evaluate in the upper-right corner to create an evaluation job.

Parameter
Description
Basic Configuration
Result Output Path
Specify the OSS path where the final evaluation results will be saved.
Public Dataset Configuration
Public Dataset
Select a public dataset.
Resource Configuration
Resource Group Type
Select a public resource group, general-purpose computing resources, or Lingjun resources based on your needs.
Job Resource
If you select a public resource group, a suitable resource specification is recommended by default based on your model size.
Click OK to start the job.
View the evaluation results.
Single-job results
On the QuickStart > Model Gallery > Job Management > Evaluation Jobs page, when the Status of the evaluation job is Succeeded, click View Report in the Operation column. On the Public Dataset Evaluation Results page, you can view the model's scores across various realms and datasets.

Multi-job comparison results
On the QuickStart > Model Gallery > Job Management > Evaluation Jobs page, select the model evaluation jobs that you want to compare and click Compare in the upper-right corner. On the Public Dataset Evaluation Results page, you can view the comparison results.

Evaluation results analysis:
The chart on the left shows the model's scores across different ability domains. A single ability domain may cover multiple datasets. To calculate the final domain score, the PAI model evaluation platform averages the model's scores from all datasets within that domain.
The chart on the right shows the model's scores on individual public datasets. For information about the evaluation scope of each public dataset, see Description of supported public datasets.
The final evaluation results are saved to the Result Output Path you specified.