Judge model is an auxiliary model that assesses the output quality of other models. It evaluates and scores the outputs of large language models (LLMs) as a judge. Platform for AI (PAI) introduces the judge model feature, an out-of-the-box LLM-as-a-Judge service that delivers an accurate, efficient, and user-friendly solution for model evaluation.
Background information
Model evaluation is a critical step in the development and deployment of LLMs, to ensure their performance aligns with expectations. This facilitates model tuning and optimization, providing users with high-quality and dependable model services. Common evaluation methods include:
Metric evaluation: Uses metrics such as BLEU, ROUGE, and METEOR to calculate the similarity between generated text and reference text, offering quick results. However, this method works only for specific scenarios like text summarization and machine translation, because it relies on reference texts and may ignore deeper semantics and coherence.
Benchmark evaluation: Uses standardized datasets, such as MMLU, GSM8k, and HumanEval, to test models against predefined tasks. Benchmarks provide standardized, comparable results, facilitating the creation of LLM leaderboards. However, this method falls short in assessing performance on subjective and open-ended questions.
Manual evaluation: Human reviewers score outputs for analysis based on established standards. Unlike rule-based evaluations, manual evaluation can handle subjective and open-ended questions without clear reference answers, understanding complex semantics and aligning with human judgment. However, this method is resource-intensive and time-consuming.
To address these limitations, PAI introduces the judge model service, which automates the evaluation of LLMs based on both subjective and objective questions with no need for manual labeling, and is not limited to specific tasks.
Overview
The judge model simplifies the evaluation process. You only need to provide the questions and answers from the model. Then, the judge model can return the scores automatically, as shown in the following figure.
Key features of the judge model include:
Accuracy: The judge model can classify subjective questions into scenarios such as open-ended discussions, creative writing, code generation, and role-playing. It then develops tailored criteria for each scenario, significantly enhancing evaluation accuracy.
Efficiency: Without the need for manual data labeling, the judge model can independently analyze and evaluate LLMs based on questions and model answers, greatly boosting evaluation efficiency.
Ease of use: PAI offers various usage methods, such as task creation in the console, API calls, and SDK calls. This allows for both quick trials and flexible integration for developers.
Cost-effectiveness: The judge model provides performance evaluation at a competitive price. Its performance is comparable to that of ChatGPT-4 in Chinese language scenarios.
Use a judge model
After the judge model feature is activated, you can use the judge model service in the following methods.
Activate and experience the service online
For beginners, the PAI console allows you to quickly get started with the judge model features.
Call API operations and API feature description
Use Python SDK or HTTP method to call the judge model service or prepare batch data to call the judge model service offline. After you enter questions and answers from models, the judge model service generates evaluation scores and explanations.
Select a model from multiple preset LLMs in the console and implement an integrated process of inference and evaluation.