This topic describes examples of single and batch calls to judge models.
Prerequisites
The judge model feature is enabled. For more information, see Enable the service and try it online.
You need to obtain the values of the Host and Token parameters on the Judge Model page, and the endpoint of the related judge model based on the value of the Host parameter. You can use the endpoint to call the judge model for evaluation.
The following table lists the endpoints of judge models in different use scenarios.
Use scenario
Feature
BASE_URL/endpoint
Call a judge model by using an SDK for Python
https://aiservice.cn-hangzhou.aliyuncs.com/v1
Call a judge model over HTTP
Chat Completions
https://aiservice.cn-hangzhou.aliyuncs.com/v1/chat/completions
Files
https://aiservice.cn-hangzhou.aliyuncs.com/v1/files
Batch
https://aiservice.cn-hangzhou.aliyuncs.com/v1/batches
Models
The following table describes the supported judge models.
Model name | Description | Context length | Maximum input | Maximum output |
pai-judge (standard edition) | The judge model is small and cost-effective. | 32768 | 32768 | 32768 |
pai-judge-plus (advanced edition) | The judge model is large and features an excellent inference effect. | 32768 | 32768 | 32768 |
Single call (online call)
Judge models support single model evaluation and dual model competition. If these evaluation modes do not meet your business requirements, you can use custom templates.
For more information about the parameters, see Input parameter and Response parameter.
Single model evaluation
Single model evaluation refers to the evaluation of the answer quality of a single large language model.
Sample request
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) completion = client.chat.completions.create( model='pai-judge', messages=[ { "role": "user", "content": [ { "mode": "single", "type": "json", "json": { "question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer": "To cross the river, find the creek." } } ] } ] ) print(completion.model_dump()) if __name__ == '__main__': main()
$ curl -X POST https://aiservice.cn-hangzhou.aliyuncs.com/v1/chat/completions \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ "model": "pai-judge", "messages": [ { "role": "user", "content": [ { "mode": "single", "type": "json", "json": { "question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer": "To cross the river, find the creek." } } ] } ] }'
Response
{ "id": "3b7c3822-1e51-4dc9-b2ad-18b9649a7f19", "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "content": "I think the overall score of the answer is [[2]] due to the following reasons:\nAdvantages of the answer: \n1. Relevance: The answer directly addresses the question. This meets the relevance criteria. [[4]]\n2. Harmlessness: The answer is appropriate and does not contain offensive content. This meets the harmlessness criteria. [[5]]\n\nShortcomings of the answer:\n1. Accuracy: The content \"To cross the river, find the creek\" in the answer does not fully align with the logical sequence of \"climbing a mountain\" and \"reaching the summit\", which does not completely correspond to the concept of \"climbing a mountain\" in the user's instruction, affecting the accuracy. [[2]]\n2. Completeness: The answer does not fully address all aspects of the question because the answer does not provide a complete story or completely align with the question. This affects the completeness of the answer. [[2]]\n3. Source reliability: The answer does not provide source information. Although the source information may not be necessary in some scenarios, the information can enhance the credibility of the answer. [[3]]\n4. Clarity and structure: Although the answer is simple in structure, its clarity and comprehensibility are affected because the answer does not fully correspond to the question. [[3]]\n5. Adaptability to the user level: Although the answer directly addresses the question, the answer may not be completely suitable for users who have a certain understanding of couplets or traditional literature due to inaccuracy. [[3]]\n\n In summary, although the answer performs well in relevance and harmlessness, the answer shows shortcomings in accuracy, completeness, source reliability, clarity and structure, and adaptability to the user level, which results in an overall rating of 2.", "role": "assistant", "function_call": null, "tool_calls": null, "refusal": "" } } ], "created": 1733260, "model": "pai-judge", "object": "chat.completion", "service_tier": "", "system_fingerprint": "", "usage": { "completion_tokens": 333, "prompt_tokens": 790, "total_tokens": 1123 } }
Dual model competition
Dual model competition refers to the evaluation of the answer quality of two large language models for the same question.
Sample request
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) completion = client.chat.completions.create( model='pai-judge', messages=[ { "role": "user", "content": [ { "mode": "pairwise", "type": "json", "json": { "question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer1": "To cross the river, find the creek.", "answer2": "To chase the dream, grasp the star." } } ] } ] ) print(completion.model_dump()) if __name__ == '__main__': main()
$ curl -X POST https://aiservice.cn-hangzhou.aliyuncs.com/v1/chat/completions \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ "model": "pai-judge", "messages": [ { "role": "user", "content": [ { "mode": "pairwise", "type": "json", "json": { "question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer1": "To cross the river, find the creek.", "answer2": "To chase the dream, grasp the star." } } ] } ] }'
Response
{ 'id': 'a7026e5a-64c5-4726-9b10-27072ff34d46', 'choices': [{ 'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': { 'content': '***\n I regard [[the two answers as equivalent]]. The overall score of Answer 1 is [[4]] and the overall score of Answer 2 is [[4]] for the following reasons: \n1. Accuracy: The two answers accurately address the question and do not contain incorrect or misleading information. [[Rating for Answer 1: 5]] [[Rating for Answer 2: 5]]\n2. Relevance: The two answers directly address the question without including unnecessary information or background and completely meet the user requirements. [[Rating for Answer 1: 5]] [[Rating for Answer 2: 5]]\n3. Harmlessness: The two answers do not contain offensive content. The two answers are positive and appropriate in their expression and meet the requirements for appropriateness and cultural sensitivity. [[Rating for Answer 1: 5]] [[Rating for Answer 2: 5]]\n4. Completeness: The two answers completely provide a right couplet to the question without missing key points. [[Rating for Answer 1: 5]] [[Rating for Answer 2: 5]]\n5. Source reliability: Although the two answers do not cite external authoritative sources, the creation and sharing of couplets often do not require external validation in this scenario. In this case, the source reliability can be ignored. [[Rating for Answer 1: 5]] [[Rating for Answer 2: 5]]\n6. Clarity and structure: The two answers are concise, clearly structured, and easy to understand. [[Rating for Answer 1: 4]] [[Rating for Answer 2: 4]]\n7. Timeliness: This criteria is inapplicable in this scenario because couplet culture has a rich history, and both answers conform to traditional expressions. [[Rating for Answer 1: N/A]] [[Rating for Answer 2: N/A]]\n8. Adaptability to the user level: The two answers use simple and understandable language. In this case, both answers are suitable for users of any level. [[Rating for Answer 1: N/A]] [[Rating for Answer 2: N/A]]\n\n In summary, the two answers perform equally well across all criteria and adequately meet the user requirements. I regard the two answers as equivalent.\n***', 'role': 'assistant', 'function_call': None, 'tool_calls': None, 'refusal': '' } }], 'created': 1734557, 'model': 'pai-judge', 'object': 'chat.completion', 'service_tier': '', 'system_fingerprint': '', 'usage': { 'completion_tokens': 408, 'prompt_tokens': 821, 'total_tokens': 1229 } }
Custom templates
After you call a judge model using the preceding examples, the system generates the corresponding prompt templates. If these templates do not meet your requirements, you can customize evaluation templates. This topic uses dual model competition as an example.
Sample request
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) system = "Please evaluate the quality of the following answers to the question from AI assistants as a judge.\n\n" \ "The following description provides basic character introductions to the AI assistants:\n" \ "AI assistants do not evaluate, compare, or do anything harmful to people. The AI assistants have a personality that leans towards being independent and autonomous.\n" user = \ "Please score the following answer to the question on a scale of 1 to 5: \n" \ "Question: What do you think is the effect of social media on relationships?\n" \ "Answer: Social media allows people to easily keep in contact with each other. However, social media can also lead to alienation.\n" \ "Scoring criteria: \n" \ "1: The answer is completely irrelevant, has no content, or is completely incorrect.\n" \ "2: Specific content of the answer is relevant. However, the content is superficial or excessively brief.\n" \ "3: The answer is relevant and provides insights. However, the answer lacks in-depth analysis.\n" \ "4: The answer is relevant, in-depth, and provides clear insights and examples.\n" \ "5: The answer is very relevant and profound and provides comprehensive insights and rich examples." completion = client.chat.completions.create( model='pai-judge', messages=[ {"role": "system", "content": system}, {"role": "user", "content": user} ] ) print(completion.model_dump()) if __name__ == '__main__': main()
$ curl -X POST https://aiservice.cn-hangzhou.aliyuncs.com/v1/chat/completions \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ "model": "pai-judge", "messages": [ { "role": "user", "content": [ {"role": "system", "content": "Please evaluate the quality of the following answer to the question from AI assistants as a judge.\n\nThe following section provides basic character introductions to the AI assistants: \n The AI assistants do not evaluate, compare, or do anything harmful to people."}, { "role": "user", "content": "Please score the following answer to the question on a scale of 1 to 5: \n" "Question: What do you think is the effect of social media on relationships?\n" "Answer: Social media allows people to easily keep in contact with each other. However, social media can also lead to alienation.\n" "Scoring criteria: \n" "1: The answer is completely irrelevant, has no content, or is completely incorrect.\n" "2: Specific content of the answer is relevant. However, the content is superficial or excessively brief.\n" "3: The answer is relevant and provides insights. However, the answer lacks in-depth analysis.\n" "4: The answer is relevant, in-depth, and provides clear insights and examples.\n" "5: The answer is very relevant and profound and provides comprehensive insights and rich examples." } ] } ] }'
Response
{ "id": "e2f72777-ddf5-4ff8-b7dd-4ecefd6e4014", "object": "chat.completion", "created": 1153092, "model": "pai-judge", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Based on the provided scoring criteria, I rate this answer as 3. The answer \"Social media allows people to easily keep in contact with each other. However, social media can also lead to alienation.\" is explicit in relevance and directly addresses the effect of social media on relationships. The answer mentions two opposite effects: strengthening relationships and alienation, which reveals some insights. However, the answer is relatively brief, does not further develop the specific effect of the two aspects or provide examples to support its views, and lacks in-depth analysis. The answer is scored 3 based on the criteria. The answer is relevant to the question and provides insights. However, the answer lacks in-depth analysis.", "refusal": "", "function_call": null, "tool_calls": null}, "finish_reason": "stop", "logprobs": null } ], "usage": { "prompt_tokens": 910, "completion_tokens": 411, "total_tokens": 1321 }, "system_fingerprint": "", "service_tier": "" }
Batch call (offline call)
Step 1: Prepare batch data
The batch data file must meet the following requirements:
The size of a single file cannot exceed 10 MB. If the file is large, you can split it into multiple files for upload.
The total size of all files uploaded by an account cannot exceed 100 GB.
The file format for batch processing API is .jsonl.
Each line contains detailed information about a single API request. The parameters in the body field of each line must include the body and a unique custom_id value. For more information about the parameters supported by the body field, see Input parameter.
For reference, the file format is as follows:
{"custom_id": "request-1", "body": {"model": "pai-judge", "messages": [{"role": "user", "content": [{"mode": "single", "type": "json", "json": {"question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer": "To cross the river, find the creek."}}]}]}} {"custom_id": "request-2", "body": {"model": "pai-judge-plus", "messages": [{"role": "user", "content": [{"mode": "single", "type": "json", "json": {"question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer": "To cross the river, find the creek."}}]}]}}
Step 2: Upload batch data
Upload the batch data to the server using the data upload interface of the judge model to obtain a unique file_id value.
Sample request:
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) upload_files = client.files.create( file=open("/home/xxx/input.jsonl", "rb"), purpose="batch", ) print(upload_files.model_dump_json(indent=4)) if __name__ == '__main__': main()
$ curl -XPOST https://aiservice.cn-hangzhou.aliyuncs.com/v1/files \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -F purpose="batch" \ -F file="@/home/xxx/input.jsonl"
Response:
{ "id": "file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", "object": "file", "bytes": 698, "created_at": 1742454203, "filename": "input.jsonl", "purpose": "batch" }
Step 3: Create a batch task
After you upload the file, create a batch task using the file_id of the input file.
In this topic, the file_id is file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713. Currently, the completion window can only be set to 24 hours. After the task is created, a unique batch_id value is returned.
Sample request:
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) create_batches = client.batches.create( endpoint="/v1/chat/completions", input_file_id="file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", completion_window="24h", ) print(create_batches.model_dump_json(indent=4)) if __name__ == '__main__': main()
$ curl -XPOST https://aiservice.cn-hangzhou.aliyuncs.com/v1/batches \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -d '{ "input_file_id": "file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", "endpoint": "/v1/chat/completions", "completion_window": "24h" }'
Response:
{ "id": "batch_66f245a0-88d1-458c-8e1c-a819a5943022", "object": "batch", "endpoint": "/v1/chat/completions", "errors": null, "input_file_id": "file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", "completion_window": "24h", "status": "Creating", "output_file_id": null, "error_file_id": null, "created_at": 1742455213, "in_process_at": null, "expires_at": null, "FinalizingAt": null, "completed_at": null, "failed_at": null, "expired_at": null, "cancelling_at": null, "cancelled_at": null, "request_counts": { "total": 3, "completed": 0, "failed": 0 }, "metadata": null }
Step 4: View the task status
Query the running status of the task using the batch_id. When the running status is Succeeded, the response contains the ID of the generated file: output_file_id.
Sample request:
import os from openai import OpenAI def main(): base_url = "http://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) retrieve_batches = client.batches.retrieve( batch_id="batch_66f245a0-88d1-458c-8e1c-a819a5943022", ) print(retrieve_batches.model_dump_json(indent=4)) if __name__ == '__main__': main()
$ curl -XGET https://aiservice.cn-hangzhou.aliyuncs.com/v1/batches/batch_66f245a0-88d1-458c-8e1c-a819a5943022 \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}"
Response:
Description of batch task objects.
{ "id": "batch_66f245a0-88d1-458c-8e1c-a819a5943022", "object": "batch", "endpoint": "/v1/chat/completions", "errors": null, "input_file_id": "file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", "completion_window": "24h", "status": "Succeeded", "output_file_id": "file-batch_output-66f245a0-88d1-458c-8e1c-a819a5943022", "error_file_id": null, "created_at": 1742455213, "in_process_at": 1742455640, "expires_at": 1742455640, "FinalizingAt": 1742455889, "completed_at": 1742455889, "failed_at": null, "expired_at": null, "cancelling_at": null, "cancelled_at": null, "request_counts": { "total": 3, "completed": 3, "failed": 0 }, "metadata": null }
Step 5: Obtain the task result
Query and download the content of the generated file using the output_file_id.
Sample request:
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) content_files = client.files.content( file_id="file-batch_output-66f245a0-88d1-458c-8e1c-a819a5943022", ) print(content_files) if __name__ == '__main__': main()
$ curl -XGET https://aiservice.cn-hangzhou.aliyuncs.com/v1/files/file-batch_output-66f245a0-88d1-458c-8e1c-a819a5943022/content \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" > output.jsonl
Response:
{"id":"dcee3584-6f30-9541-a855-873a6d86b7d9","custom_id":"request-1","response":{"status_code":200,"request_id":"dcee3584-6f30-9541-a855-873a6d86b7d9","body":{"created":1737446797,"usage":{"completion_tokens":7,"prompt_tokens":26,"total_tokens":33},"model":"pai-judge","id":"chatcmpl-dcee3584-6f30-9541-a855-873a6d86b7d9","choices":[{"finish_reason":"stop","index":0,"message":{"content":"2+2 equals 4."}}],"object":"chat.completion"}},"error":null} {"id":"dcee3584-6f30-9541-a855-873a6d86b7d9","custom_id":"request-2","response":{"status_code":200,"request_id":"dcee3584-6f30-9541-a855-873a6d86b7d9","body":{"created":1737446797,"usage":{"completion_tokens":7,"prompt_tokens":26,"total_tokens":33},"model":"pai-judge-plus","id":"chatcmpl-dcee3584-6f30-9541-a855-873a6d86b7d9","choices":[{"finish_reason":"stop","index":0,"message":{"content":"2+2 equals 4."}}],"object":"chat.completion"}},"error":null}