All Products
Search
Document Center

Platform For AI:Activate and experience the service online

Last Updated:Apr 01, 2026

Evaluating Large Language Model (LLM) outputs typically requires writing evaluation scripts and preparing labeled datasets. The PAI Judge Model service removes those barriers: enter a question and one or two model responses in the browser, and the judge model scores the result immediately — no code or datasets required.

With the Playground, you can:

  • Score a single model's response against a reference answer

  • Compare two models head-to-head to determine which performs better

  • Adjust the scoring range, evaluation criteria, and scenario to match your use case

  • Inspect the full prompt the judge model uses to understand how it reasons about the evaluation

Prerequisites

Before you begin, ensure that you have:

Activate the service

  1. Log on to the PAI console. On the Judge Model page, click Activate Now and follow the on-screen instructions.

    image

  2. After activation, go to the Overview tab to view the Host and Token values and check call statistics.

Experience the service online

  1. On the Judge Model page, switch to the Playground tab.

  2. Select an evaluation mode.

    ModeWhen to use
    Single-answer GradingScore one model's response. Use this when you want an absolute quality score for a single output.
    Dual-model CompetitionCompare responses from two models. Use this when you want to determine which model performs better on a given question.
  3. Configure the evaluation content.

    ParameterDescription
    Judge ModelThe model that acts as the judge. pai-judge is a smaller model optimized for cost-effectiveness. pai-judge-plus is a larger model that produces higher-quality inference results.
    QuestionThe question to evaluate.
    Model ResponseThe model output(s) to evaluate. For Single-answer Grading, enter one response. For Dual-model Competition, enter responses from two models.
    Reference Answer(Optional) A known correct answer. Providing a reference answer improves accuracy for deterministic questions, math problems, and translations.
  4. (Optional) Configure advanced settings.

    Evaluation scenario

    ParameterDescription
    Question ScenarioThe scenario that best describes your question. The judge model auto-detects the scenario, but you can also set it manually. Supported scenarios include text rewriting, role assumption, code generation, code modification, and code analysis. Each scenario applies different evaluation criteria, which helps the judge model score more precisely.
    Scenario DescriptionA description of the scenario.
    Evaluation CriteriaThe criteria the judge model uses to score responses. Specify custom criteria to align the evaluation with your requirements.

    Evaluation score

    ParameterDescription
    Score RangeThe scoring scale for the judge model. Valid values: 2–10.
    Score DescriptionThe meaning assigned to each score value.

    Generation parameters

    ParameterDescription
    TemperatureControls the randomness of the generated output. Lower values produce more deterministic scores; higher values produce more varied scores. Valid values: [0, 2).
    Top_pControls the range of candidate tokens. The model samples from the smallest set of tokens whose cumulative probability reaches the Top_p value. Valid values: [0, 1].
  5. Click Evaluate. The judge model streams the evaluation result to the Evaluation Result tab. Submit feedback on the result to help improve the judge model.

    To understand how the judge model reasons about the evaluation, switch to the Prompt Preview tab. This tab shows the complete prompt sent to the judge model, with your inputs inserted into the prompt template.

  6. To explore the Playground without entering your own data, click Fill In Random Example. The page loads a pre-configured example so you can see the judge model in action.

    image