LLM evaluation use cases - Platform For AI - Alibaba Cloud Documentation Center

Background

Introduction

As LLMs evolve, a systematic evaluation process helps developers compare models, guide selection, and accelerate deployment. Platform-based evaluation best practices make this repeatable and efficient.

Use this guide to build an evaluation workflow on PAI that accurately reflects model performance. It covers how to:

Prepare and select evaluation datasets
Select open-source or fine-tuned models for your use case
Create evaluation jobs and choose appropriate evaluation methods
Interpret evaluation results in single-task or multi-task scenarios

Platform features

PAI LLM evaluation supports comparisons such as:

Compare different foundation models, such as Qwen2-7B-Instruct vs. Baichuan2-7B-Chat.
Compare different fine-tuned versions of the same model, such as Qwen2-7B-Instruct versions trained for a different number of epochs on private data.
Compare different quantized versions of the same model, such as Qwen2-7B-Instruct-GPTQ-Int4 vs. Qwen2-7B-Instruct-GPTQ-Int8.

This topic uses enterprise developers and algorithm researchers as examples to show how to combine custom datasets with public datasets (such as MMLU and C-Eval) for targeted evaluation. Key platform features:

No-code, end-to-end evaluation pipeline for mainstream open-source LLMs and one-click evaluation for fine-tuned LLMs.
Custom dataset upload, 10+ built-in NLP metrics, and consolidated results — no evaluation scripts needed.
Built-in public datasets across multiple domains with fully reproduced official evaluation methods and radar chart overview.
Simultaneous multi-model, multi-task evaluation with chart-based comparison and per-record details.
Open-source evaluation code in the eval-scope repository (co-developed with ModelScope) for full transparency and reproducibility.

Billing

PAI LLM evaluation is a feature of PAI QuickStart. QuickStart itself is free, but you may incur fees for DLC evaluation jobs. Billing of Deep Learning Containers (DLC).
Custom dataset evaluation may incur OSS fees. Billing overview of OSS.

Use case 1: Custom dataset evaluation

Enterprise developers typically evaluate open-source or fine-tuned LLMs with custom datasets built from private, domain-specific data to measure model performance in their target domain.

PAI evaluates custom datasets by calculating text similarity between model output and ground truth using standard NLP metrics. Higher scores indicate better performance.

Key steps are outlined below. The full procedure is in Model evaluation.

Prepare a custom evaluation dataset.
1. Format the dataset:
  
  To evaluate with a custom dataset, prepare an evaluation file in JSONL format (example: llmuses_general_qa_test.jsonl, 76 KB). The format is as follows:
```
[{"question": "Is it true that China invented papermaking?", "answer": "True"}]
[{"question": "Is it true that China invented gunpowder?", "answer": "True"}]
```
  Use question to identify the question column, and answer to identify the answer column.
2. Upload the evaluation file in the required format to OSS. Upload files to OSS.
3. Create an evaluation dataset from the OSS file. Create a dataset from an Alibaba Cloud product.
Select a model.

Open-source model

In the PAI console, navigate to Quick Start > Model Gallery. For models that can be evaluated, an Evaluate button appears when you hover over the model card.

The model card also displays a Deploy button for quick deployment.

Fine-tuned model

In the PAI console, navigate to Quick Start > Model Gallery. For models that can be evaluated, an Evaluate button appears when you hover over the model card. After you fine-tune a model, navigate to Quick Start > Model Gallery > Job Management > Training Jobs. Click a successfully completed job, and an Evaluate button appears in the upper-right corner.

The model evaluation feature supports all Hugging Face models of the AutoModelForCausalLM type.

Create and run an evaluation job.

On the model details page, click Evaluate in the upper-right corner to create an evaluation job.

Configure the key parameters as follows:

Parameter		Description
Base configuration	Result Output Path	Specify the OSS path for saving evaluation results.
Custom Dataset Configuration	Evaluation Method	The options are: General Metric Evaluation: Calculates the text similarity between the model's predictions and reference answers using metrics such as ROUGE and BLEU. This method is suitable for scenarios with definite answers. Judge Model Evaluation: Uses the LLM-as-a-Judge model provided by Alibaba Cloud PAI to automatically score the model's answers. Reference answers are not required. This method is suitable for scenarios with complex or non-unique answers. The results include an overall score and scores for five sub-metrics.
	LLM-as-a-Judge service token	This parameter is required if you select LLM-as-a-Judge Evaluation. To obtain a token, visit the LLM-as-a-Judge model page.
	Dataset Source	Select an existing dataset.
	Select an existing dataset.	Select the custom dataset that you created.
Resource Configuration	Resource Group Type	Select a public resource group, general computing resources, or Lingjun resources.
Resource Configuration	Job Resource	If you select public resource group for Resource group type, the system recommends resources based on your model specifications.

Click Submit to run the job.

View the evaluation results.

Single-task results

On the Quick Start > Model Gallery > Job Management > Evaluation Jobs page, click View Report in the Actions column for a job with a Status of Succeeded. On the Custom Dataset Evaluation Result page, you can view the model's scores on various ROUGE and BLEU metrics.

The page also displays detailed evaluation results for each item in the evaluation file.

Multi-task comparison results

On the QuickStart > Model Gallery > Job Management > Evaluation Jobs page, select the evaluation jobs to compare and click Compare in the upper-right corner to view the comparison results on the Custom Dataset Evaluation Result page.

The comparison results page contains the Custom Dataset Evaluation Result and Public Dataset Evaluation Result tabs. In the General Metric Evaluation Result area, you can compare model performance across metrics such as BLEU (bleu-1 to bleu-4) and ROUGE (precision, recall, and f-score for rouge-1, rouge-2, and rouge-l) using a radar chart. The data table below shows the detailed metric values for each model and can be exported.

Interpreting the results:

The default evaluation metrics for a custom dataset include: rouge-1-f, rouge-1-p, rouge-1-r, rouge-2-f, rouge-2-p, rouge-2-r, rouge-l-f, rouge-l-p, rouge-l-r, bleu-1, bleu-2, bleu-3, and bleu-4.
- ROUGE-n measures n-gram overlap (ROUGE-1 for unigrams, ROUGE-2 for bigrams). ROUGE-L uses the Longest Common Subsequence (LCS).
- BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated output and reference translations. BLEU-n computes the n-gram match rate.
The final evaluation results are saved to the Result output path you specified.

Use case 2: Public dataset evaluation

Algorithm researchers rely on public datasets to select open-source models or validate fine-tuning results. PAI integrates public datasets across multiple domains and fully reproduces official evaluation metrics, eliminating the need to download datasets individually or learn each evaluation workflow.

PAI categorizes public datasets by domain (math, knowledge, reasoning) to assess overall model capabilities. Higher scores indicate better performance.

Key steps are outlined below. The full procedure is in Model evaluation.

Supported public datasets:

PAI supports the following public datasets (more are being added):

Dataset	Size	Records	Domain
MMLU	166 MB	14,042	Knowledge
TriviaQA	14.3 MB	17,944	Knowledge
C-Eval	1.55 MB	12,342	Chinese
CMMLU	1.08 MB	11,582	Chinese
GSM8K	4.17 MB	1,319	Math
HellaSwag	47.5 MB	10,042	Reasoning
TruthfulQA	0.284 MB	816	Safety

Select a model.

Open-source model

In the PAI console, navigate to QuickStart > Model Gallery. For models that can be evaluated, an Evaluate button appears when you hover over the model card.

Fine-tuned model

In the PAI console, navigate to QuickStart > Model Gallery. For models that can be evaluated, an Evaluate button appears when you hover over the model card. After you fine-tune a model, navigate to QuickStart > Model Gallery > Job Management > Training Jobs. Click a successfully completed job, and an Evaluate button appears in the upper-right corner.

The model evaluation feature supports all Hugging Face models of the AutoModelForCausalLM type.

Create and run an evaluation job.

On the model details page, click Evaluate in the upper-right corner to create an evaluation job.

In the Base configuration section, enter a Job Name and select the model to evaluate from the Model drop-down list, for example, Qwen3-235B-A22B.

Parameter		Description
Base configuration	Result Output Path	Specify the OSS path for saving evaluation results.
Public dataset configuration	Public Dataset	Select one or more public datasets.
Resource Configuration	Resource Group Type	Select a public resource group, general computing resources, or Lingjun resources.
Resource Configuration	Job Resource	If you select public resource group for Resource group type, the system recommends resources based on your model specifications.

Click Submit to run the job.

View the evaluation results.

Single-task results

On the Quick Start > Model Gallery > Job Management > Evaluation Jobs page, click View Report in the Actions column for a job with a Status of Succeeded. You can then view the model's scores across different domains and datasets on the Custom Dataset Evaluation Result page.

On the Evaluation Report tab, switch to the Public Dataset Evaluation Result sub-tab to view the score distribution of the model on public datasets such as C-Eval, CMMLU, GSM8K, HellaSwag, MMLU, TriviaQA, and TruthfulQA in a radar chart.

Multi-task comparison results

On the Quick Start > Model Gallery > Job Management > Evaluation Jobs page, select the evaluation jobs to compare and click Compare in the upper-right corner to view the results on the Evaluation Results of Public Datasets page.

A radar chart at the top of the page shows the scores for each dataset. The table below lists the Job Name, Model, and dataset scores for each evaluation job.

Interpreting the results:
- If a domain includes multiple datasets, PAI LLM evaluation calculates the domain score by averaging the model's scores across those datasets.
- The report also shows per-dataset scores. The Supported public datasets section lists each dataset's scope.
The final evaluation results are saved to the Result output path you specified.

Platform For AI:Best practices for LLM evaluation

Background

Introduction

Platform features

Billing

Use case 1: Custom dataset evaluation

Open-source model

Fine-tuned model

Single-task results

Multi-task comparison results

Use case 2: Public dataset evaluation

Open-source model

Fine-tuned model

Single-task results

Multi-task comparison results

Related documentation