An LLM-as-a-Judge is an auxiliary model that evaluates the output quality of other models. It acts as a "judge" to assess and score the results from a large language model (LLM). PAI provides an out-of-the-box LLM-as-a-Judge service that offers an accurate, efficient, and easy-to-use intelligent solution for model evaluation.
Background
Model evaluation is a key part of developing and deploying an LLM. It verifies that a model performs as expected, guides model selection, helps optimize model calls, and tests the reliability of the model service. Common model evaluation methods include the following:
Metric-based evaluation
This method uses evaluation metrics, such as BLEU, ROUGE, and METEOR. It quickly provides evaluation results by calculating the similarity between the generated text and a reference text. However, this method has limitations. It applies only to limited scenarios, such as text summarization and machine translation. It requires a reference text. It also often considers only surface-level similarity and may ignore deeper semantics or contextual coherence.
Benchmark evaluation
This method uses standardized datasets, such as MMLU, GSM8k, and HumanEval, to test models on a series of predefined tasks. Many popular benchmarks are available. Because these benchmarks focus on objective questions, the results are standardized and comparable. This makes it easy to create LLM leaderboards. However, this method cannot evaluate a model's performance on subjective or open-ended questions.
Human evaluation
This method involves setting scoring criteria, having human reviewers assign scores, and then compiling the results for statistical analysis. The previous two methods are rule-based and cannot evaluate subjective questions that lack clear reference answers. In contrast, human evaluation can interpret complex and diverse semantics, and the results align with human expectations. However, human evaluation requires significant resources and time.
LLM-as-a-Judge was created to address the limitations of these methods. An LLM-as-a-Judge does not require manual annotation or be limited to specific tasks. It can perform automated, efficient, and batch evaluations for both subjective and objective questions.
Function overview
PAI provides an LLM-as-a-Judge. To obtain automated scores, you can provide a question and the answer from the model that you want to evaluate. The core principle is as follows:
The key features of the LLM-as-a-Judge are:
Accurate: The LLM-as-a-Judge excels at evaluating subjective questions. It can intelligently classify questions into scenarios, such as open-ended questions (chat, consultation, or recommendation), creative writing, code generation, and role assumption. It applies different evaluation criteria for each scenario, which greatly improves accuracy.
Efficient: The LLM-as-a-Judge does not require manually annotated data. You can simply input a question and the model's answer, and the service automatically analyzes and evaluates the LLM. This significantly improves evaluation efficiency.
Easy to use: It offers multiple ways to use the service, including creating evaluation tasks in the console, calling the API, and using the software development kit (SDK). This allows new users to quickly get started and provides developers with flexible integration options.
Low cost: It delivers evaluation performance comparable to ChatGPT-4 in Chinese language scenarios at a low cost.
Using a judge model
After you enable the LLM-as-a-Judge feature, you can use the service in the following ways:
New users can quickly use the LLM-as-a-Judge feature in the console.
API call examples, API reference
You can use the Python SDK or HTTP for online calls. You can also prepare batch data to call the LLM-as-a-Judge algorithm service offline. Input a question and the model's inference result to receive an evaluation score and the reasoning.
In the console, you can select from various predefined LLM models. This enables an integrated flow for both inference and evaluation.