All Products
Search
Document Center

Platform For AI:Model evaluation (ModelEval)

Last Updated:Dec 18, 2025

Model evaluation (ModelEval) is a PAI tool that evaluates the performance of large language models (LLMs) in specific or general scenarios. You can use authoritative public datasets or your custom datasets to quantitatively analyze the capabilities of a model. This analysis provides data to support model selection, fine-tuning, and version iteration.

Getting started: Complete your first model evaluation in 5 minutes

This section describes how to complete a simple evaluation task with minimal configuration. You will evaluate the Qwen3-4B model using the public CMMLU dataset.

  1. Log in to the PAI console. In the left navigation pane, choose Model Application > Model Evaluation (ModelEval).

  2. On the Model Evaluation page, click Create Task.

  3. Basic Configuration: The default Task Name and Result Output Path are system-generated.

    Note

    If a default OSS storage path is not set for the workspace, you must manually specify a result output path.

  4. Configure Evaluation Object:

    • Set Evaluation Object Type to Public Model.

    • In the Public Model drop-down list, search for and select Qwen3-4B.

  5. Configure Evaluation Method:

    • Select Public Dataset Evaluation.

    • In the dataset list, select CMMLU.

  6. Configure Resources:

    • Set Resource Group Type to Public Resource Group (Pay-As-You-Go) and Resource Configuration Method to General Resource.

    • From the Resource Specification drop-down list, select a GPU specification, such as ecs.gn7i-c8g1.2xlarge (24 GB).

  7. Submit the task: Click OK at the bottom of the page.

After the task is submitted, the page automatically redirects to the task details. Once the task status changes to Succeeded, you can view the performance of the Qwen3-4B model on the CMMLU dataset on the Evaluation Report tab.

Features

Configure the evaluation object

Model evaluation supports four types of evaluation object sources. Select a source based on the deployment location of your model or service.

Evaluation object type

Description

Scenarios

Public model

Models from the Model Gallery in PAI

Quickly evaluate the performance of mainstream open source large language models

Custom model

Custom models registered in AI Asset Management > Models

Important

Ensure that the model is compatible with the vLLM framework.

Evaluate fine-tuned or customized models

PAI-EAS service

Deployed PAI-EAS online inference services

Evaluate model services in a production environment

Custom service

Any model service that complies with the OpenAI API specifications

Evaluate third-party or self-built model services

Configure the evaluation method

You can use a custom dataset, a public dataset, or a combination of both for the evaluation.

Custom dataset evaluation

Use your own dataset to evaluate your model in a way that best reflects your actual business scenarios.

  • Dataset format: The dataset must be in the JSONL format and use UTF-8 encoding. Each line must be a single JSON object.

  • Dataset upload: You must upload the prepared dataset file to OSS and enter its OSS path on the configuration page.

Evaluation method

General metric evaluation
Judge model evaluation

Use cases

Use this method when you have clear, standard answers. It calculates the text similarity between the model's output and the standard answer. This is suitable for tasks such as translation, summarization, and knowledge base Q&A.

Use this method when there is no single standard answer to a question, such as in open-ended conversations or content creation. A powerful "judge model" is used to score the quality of the model's response.

Dataset format

The JSON object must contain the question and answer (standard answer) fields.

{"question": "What is the capital of China?", "answer": "Beijing"}

The JSON object can contain only the question field, or it can also provide an answer (standard answer) field.

{"question": "Please describe the history of artificial intelligence"}

Core metrics

  • ROUGE (ROUGE-1, ROUGE-2, ROUGE-L): Based on recall rate, this metric measures how many information points from the standard answer are covered by the model's output.

  • BLEU (BLEU-1, BLEU-2, BLEU-3, BLEU-4): Based on precision, this metric measures how much of the model's output is accurate.

The system sends the question and the output from the model being evaluated to the judge model. The judge model then provides a comprehensive score based on multiple dimensions, such as relevance, accuracy, and fluency.

Public dataset evaluation

Use industry-recognized and authoritative datasets to compare the capabilities of your model against industry benchmarks.

  • Use cases: Comparing models for selection, performing pre-release benchmark testing, and evaluating the general capabilities of a model.

  • Configuration: Select Public Dataset Evaluation and choose one or more datasets from the list.

  • Supported datasets:

    • LiveCodeBench: Evaluates code processing capabilities.

    • Math500: Evaluates mathematical reasoning capabilities (500 difficult math competition problems).

    • AIME25: Evaluates mathematical reasoning capabilities (based on problems from the 2025 American Invitational Mathematics Examination).

    • AIME24: Evaluates mathematical reasoning capabilities (based on problems from the 2024 American Invitational Mathematics Examination).

    • CMMLU: Evaluates Chinese multi-disciplinary language understanding.

    • MMLU: Evaluates English multi-disciplinary language understanding.

    • C-Eval: Evaluates comprehensive Chinese language capabilities.

    • GSM8K: Evaluates mathematical reasoning capabilities.

    • HellaSwag: Evaluates commonsense reasoning capabilities.

    • TruthfulQA: Evaluates truthfulness.

Task management

On the Model Evaluation page, you can manage the lifecycle of your evaluation tasks.

  • View Report: For tasks with a status of Succeeded, click this button to view the detailed evaluation report.

  • Compare: You can select 2 to 5 successful tasks and click the Compare button to perform a side-by-side comparison of their performance on various metrics.

  • Stop: You can manually stop tasks that are Running. This operation is irreversible. The task cannot be resumed, and the consumed compute resources will not be refunded.

  • Delete: Deletes the task record. This operation cannot be undone.