All Products
Search
Document Center

Platform For AI:Best practices for LLM evaluation

Last Updated:Mar 18, 2026

Evaluate LLM performance by comparing foundation models, fine-tuned versions, and quantized versions using custom or public datasets with automated metrics.

Background

Introduction

As LLMs become more capable, scientific and efficient evaluation becomes increasingly important for measuring and comparing model performance. Evaluation guides model selection and optimization, accelerating AI innovation and application.

PAI provides best practices for LLM evaluation to help AI developers build evaluation processes that reflect true model performance and meet specific industry needs. Topics covered:

  • Prepare and select evaluation datasets

  • Select open source or fine-tuned models

  • Create evaluation tasks and select evaluation methods

  • Interpret task results in single-task or multi-task scenarios

Platform features

PAI LLM evaluation compares model performance across scenarios:

  • Compare foundation models: Qwen2-7B-Instruct vs. Baichuan2-7B-Chat

  • Compare fine-tuned versions of the same model (e.g., different epoch versions of Qwen2-7B-Instruct trained on private data)

  • Compare quantized versions: Qwen2-7B-Instruct-GPTQ-Int4 vs. Qwen2-7B-Instruct-GPTQ-Int8

PAI addresses needs of enterprise developers and algorithm researchers by combining custom datasets with public datasets (MMLU, C-Eval) for comprehensive, accurate, and targeted model evaluation. Features:

  • End-to-end evaluation pipeline with no code development. Supports mainstream open source LLMs and one-click evaluation after fine-tuning.

  • Upload custom datasets. 10+ built-in NLP evaluation methods with consolidated results display.

  • Evaluate on public datasets from multiple domains. Fully reproduces official evaluation methods with panoramic radar chart view.

  • Simultaneous evaluation of multiple models and tasks with comparative charts and detailed per-item results.

  • Transparent, reproducible evaluation. Evaluation code is open source in the eval-scope repository, co-built with ModelScope.

Billing

Scenario 1: Custom dataset evaluation for enterprise developers

Enterprises often accumulate rich private data. A key part of using LLMs for algorithm optimization is leveraging this data. Enterprise developers evaluate open source or fine-tuned LLMs using custom datasets from private data to better understand model performance in specific domains.

For custom dataset evaluation, PAI uses standard text-matching methods from NLP to calculate similarity between model output and ground truth. Higher values indicate better models.

Key process steps (for details, see Model evaluation):

  1. Prepare a custom evaluation set.

    1. Format:

      Prepare evaluation set file in JSONL format. Example: llmuses_general_qa_test.jsonl (76 KB):

      [{"question": "Is it true that China invented papermaking?", "answer": "True"}]
      [{"question": "Is it true that China invented gunpowder?", "answer": "True"}]

      Use question to identify question column and answer to identify answer column.

    2. Upload evaluation set file to OSS. For more information, see Upload files to OSS.

    3. Create dataset from OSS file. For more information, see Create a dataset from an Alibaba Cloud product.

  2. Select a model.

    Use an open source model

    In the PAI console, go to Quick Start > Model Gallery. Hover over a model card to display the Evaluate button for supported models.

    image

    Use a fine-tuned model

    In the PAI console, go to Quick Start > Model Gallery. Hover over a model card to display the Evaluate button. After fine-tuning, go to Quick Start > Model Gallery > Job Management > Training Jobs. Click a completed training job to display the Evaluate button.

    image

    Model evaluation currently supports all AutoModelForCausalLM type models from Hugging Face.

  3. Create and run evaluation task.

    Click Evaluate in the model detail page to create an evaluation task.

    image

    Key parameters:

    Parameter

    Description

    Base configuration

    Result Output Path

    OSS path where evaluation results are saved.

    Custom Dataset Configuration

    Evaluation Method

    Options:

    • General Metric Evaluation: Calculates text similarity between model predictions and reference answers (ROUGE, BLEU). Suitable for definitive answers.

    • Judge Model Evaluation: Uses LLM-as-a-Judge model to automatically score answers. Reference answers not required. Suitable for complex or non-unique answers. Results include overall score and 5 specific metrics.

    LLM-as-a-Judge Service Token

    Required when Evaluation Method is LLM-as-a-Judge Evaluation. Obtain token from LLM-as-a-Judge page.

    Dataset Source

    Select existing dataset.

    Select an existing dataset.

    Select custom dataset created earlier.

    Resource Configuration

    Resource Group Type

    Select public resource group, general computing resources, or Lingjun resources.

    Job Resource

    If Resource Group Type is public resource group, system recommends resources based on model specifications.

    Click Submit to start the task.

  4. View evaluation results.

    Single task result

    When Status for an evaluation task on Quick Start > Model Gallery > Job Management > Evaluation Jobs page is Succeeded, click Actions > View Report to view ROUGE and BLEU scores on the Custom Dataset Evaluation Result page.

    image

    Page also displays detailed evaluation results for each data entry.

    Multi-task comparison result

    On QuickStart > Model Gallery > Job Management > Evaluation Jobs page, select model evaluation tasks to compare. Click Compare to view comparison results on the Custom Dataset Evaluation Result page.

    image

    Result analysis:

    Default evaluation methods for custom datasets: rouge-1-f, rouge-1-p, rouge-1-r, rouge-2-f, rouge-2-p, rouge-2-r, rouge-l-f, rouge-l-p, rouge-l-r, bleu-1, bleu-2, bleu-3, and bleu-4.

    • ROUGE-n metrics calculate overlap of N-grams (N consecutive words). ROUGE-1 and ROUGE-2 are most common, corresponding to unigrams and bigrams. ROUGE-L is based on Longest Common Subsequence (LCS).

    • BLEU (Bilingual Evaluation Understudy) evaluates machine translation quality by measuring N-gram overlap between machine translation output and reference translations. BLEU-n calculates N-gram match rate.

    Final evaluation results are saved to the Output Path set earlier.

Scenario 2: Public dataset evaluation for algorithm researchers

Algorithm research often relies on public datasets. When researchers select open source models or fine-tune models, they refer to evaluation performance on authoritative public datasets. PAI provides access to public datasets from multiple domains and fully reproduces official evaluation metrics to obtain accurate performance feedback, facilitating efficient LLM research.

Public dataset evaluation assesses comprehensive LLM capabilities (mathematical, knowledge, reasoning) by classifying open source evaluation datasets by domain. Higher values indicate better models.

Key process steps (for details, see Model evaluation):

  1. Supported public datasets:

    PAI currently maintains public datasets including MMLU, TriviaQA, HellaSwag, GSM8K, C-Eval, CMMLU, and TruthfulQA. More datasets are being added.

    Dataset

    Size

    Data volume

    Domain

    MMLU

    166 MB

    14042

    Knowledge

    TriviaQA

    14.3 MB

    17944

    Knowledge

    C-Eval

    1.55 MB

    12342

    Chinese

    CMMLU

    1.08 MB

    11582

    Chinese

    GSM8K

    4.17 MB

    1319

    Math

    HellaSwag

    47.5 MB

    10042

    Reasoning

    TruthfulQA

    0.284 MB

    816

    Security

  2. Select a model.

    Use an open source model

    In the PAI console, go to QuickStart > Model Gallery. Hover over a model card to display the Evaluate button for supported models.

    image

    Use a fine-tuned model

    In the PAI console, go to QuickStart > Model Gallery. Hover over a model card to display the Evaluate button. After fine-tuning an evaluable model, go to QuickStart > Model Gallery > Job Management > Training Jobs. Click the successfully trained job to display the Evaluate button.

    image

    Model evaluation currently supports all AutoModelForCausalLM type models from Hugging Face.

  3. Create and run evaluation task.

    Click Evaluate in the model detail page to create an evaluation task.

    image

    Parameter

    Description

    Base configuration

    Result Output Path

    OSS path where evaluation results are saved.

    Public Dataset Configuration

    Public Dataset

    Select a public dataset.

    Resource Configuration

    Resource Group Type

    Select public resource group, general computing resources, or Lingjun resources.

    Job Resource

    If Resource Group Type is public resource group, system recommends resources based on model specifications.

    Click Submit to start the task.

  4. View evaluation results.

    Single task result

    When Status of an evaluation task on Quick Start > Model Gallery > Job Management > Evaluation Jobs page changes to Succeeded, click View Report in the Actions column to view model scores for various realms and datasets on the Custom Dataset Evaluation Result page.

    image

    Multi-task comparison result

    On Quick Start > Model Gallery > Job Management > Evaluation Jobs page, select model evaluation tasks to compare and click Compare to view comparison results on the Evaluation Results of Public Datasets page.

    image

    Result analysis:

    • Left chart shows model scores in different domains. Each domain may have multiple related datasets. For datasets in the same domain, PAI calculates the mean of model scores as the domain score.

    • Right chart shows model scores on each public dataset. For more information about evaluation scope of each dataset, see Supported public datasets.

    Final evaluation results are saved to the Output Path set earlier.

References

Model evaluation