Evaluating Large Language Model (LLM) outputs typically requires writing evaluation scripts and preparing labeled datasets. The PAI Judge Model service removes those barriers: enter a question and one or two model responses in the browser, and the judge model scores the result immediately — no code or datasets required.
With the Playground, you can:
Score a single model's response against a reference answer
Compare two models head-to-head to determine which performs better
Adjust the scoring range, evaluation criteria, and scenario to match your use case
Inspect the full prompt the judge model uses to understand how it reasons about the evaluation
Prerequisites
Before you begin, ensure that you have:
Activate the service
Log on to the PAI console. On the Judge Model page, click Activate Now and follow the on-screen instructions.

After activation, go to the Overview tab to view the Host and Token values and check call statistics.
Experience the service online
On the Judge Model page, switch to the Playground tab.
Select an evaluation mode.
Mode When to use Single-answer Grading Score one model's response. Use this when you want an absolute quality score for a single output. Dual-model Competition Compare responses from two models. Use this when you want to determine which model performs better on a given question. Configure the evaluation content.
Parameter Description Judge Model The model that acts as the judge. pai-judge is a smaller model optimized for cost-effectiveness. pai-judge-plus is a larger model that produces higher-quality inference results. Question The question to evaluate. Model Response The model output(s) to evaluate. For Single-answer Grading, enter one response. For Dual-model Competition, enter responses from two models. Reference Answer (Optional) A known correct answer. Providing a reference answer improves accuracy for deterministic questions, math problems, and translations. (Optional) Configure advanced settings.
Evaluation scenario
Parameter Description Question Scenario The scenario that best describes your question. The judge model auto-detects the scenario, but you can also set it manually. Supported scenarios include text rewriting, role assumption, code generation, code modification, and code analysis. Each scenario applies different evaluation criteria, which helps the judge model score more precisely. Scenario Description A description of the scenario. Evaluation Criteria The criteria the judge model uses to score responses. Specify custom criteria to align the evaluation with your requirements. Evaluation score
Parameter Description Score Range The scoring scale for the judge model. Valid values: 2–10. Score Description The meaning assigned to each score value. Generation parameters
Parameter Description Temperature Controls the randomness of the generated output. Lower values produce more deterministic scores; higher values produce more varied scores. Valid values: [0, 2). Top_p Controls the range of candidate tokens. The model samples from the smallest set of tokens whose cumulative probability reaches the Top_p value. Valid values: [0, 1]. Click Evaluate. The judge model streams the evaluation result to the Evaluation Result tab. Submit feedback on the result to help improve the judge model.
To understand how the judge model reasons about the evaluation, switch to the Prompt Preview tab. This tab shows the complete prompt sent to the judge model, with your inputs inserted into the prompt template.
To explore the Playground without entering your own data, click Fill In Random Example. The page loads a pre-configured example so you can see the judge model in action.
