All Products
Search
Document Center

Platform For AI:Judge model

Last Updated:Apr 23, 2025

Judge model is an auxiliary model that assesses the output quality of other models. It evaluates and scores the outputs of large language models (LLMs) as a judge. Platform for AI (PAI) introduces the judge model feature, an out-of-the-box LLM-as-a-Judge service that delivers an accurate, efficient, and user-friendly solution for model evaluation.

Background information

Model evaluation is a critical step in the development and deployment of LLMs, to ensure their performance aligns with expectations. This facilitates model tuning and optimization, providing users with high-quality and dependable model services. Common evaluation methods include:

  1. Metric evaluation: Uses metrics such as BLEU, ROUGE, and METEOR to calculate the similarity between generated text and reference text, offering quick results. However, this method works only for specific scenarios like text summarization and machine translation, because it relies on reference texts and may ignore deeper semantics and coherence.

  2. Benchmark evaluation: Uses standardized datasets, such as MMLU, GSM8k, and HumanEval, to test models against predefined tasks. Benchmarks provide standardized, comparable results, facilitating the creation of LLM leaderboards. However, this method falls short in assessing performance on subjective and open-ended questions.

  3. Manual evaluation: Human reviewers score outputs for analysis based on established standards. Unlike rule-based evaluations, manual evaluation can handle subjective and open-ended questions without clear reference answers, understanding complex semantics and aligning with human judgment. However, this method is resource-intensive and time-consuming.

To address these limitations, PAI introduces the judge model service, which automates the evaluation of LLMs based on both subjective and objective questions with no need for manual labeling, and is not limited to specific tasks.

Overview

The judge model simplifies the evaluation process. You only need to provide the questions and answers from the model. Then, the judge model can return the scores automatically, as shown in the following figure.

image

Key features of the judge model include:

  • Accuracy: The judge model can classify subjective questions into scenarios such as open-ended discussions, creative writing, code generation, and role-playing. It then develops tailored criteria for each scenario, significantly enhancing evaluation accuracy.

  • Efficiency: Without the need for manual data labeling, the judge model can independently analyze and evaluate LLMs based on questions and model answers, greatly boosting evaluation efficiency.

  • Ease of use: PAI offers various usage methods, such as task creation in the console, API calls, and SDK calls. This allows for both quick trials and flexible integration for developers.

  • Cost-effectiveness: The judge model provides performance evaluation at a competitive price. Its performance is comparable to that of ChatGPT-4 in Chinese language scenarios.

Use a judge model

After the judge model feature is activated, you can use the judge model service in the following methods.

  • Activate and experience the service online

    For beginners, the PAI console allows you to quickly get started with the judge model features.

  • Call API operations and API feature description

    Use Python SDK or HTTP method to call the judge model service or prepare batch data to call the judge model service offline. After you enter questions and answers from models, the judge model service generates evaluation scores and explanations.

  • Model evaluation

    Select a model from multiple preset LLMs in the console and implement an integrated process of inference and evaluation.