本文为您介绍裁判员模型单次调用和批量调用的示例。
前提条件
已开通裁判员模型功能。具体操作,请参见开通服务及在线体验。
已在裁判员模型页面获取在线调用参数Host和Token,并使用Host拼接成访问地址,通过该地址调用裁判员模型进行模型评测。
访问地址如下:
调用场景
功能
BASE_URL/endpoint
通过Python SDK调用裁判员模型
https://aiservice.cn-hangzhou.aliyuncs.com/v1
通过HTTP调用裁判员模型
Chat Completions
https://aiservice.cn-hangzhou.aliyuncs.com/v1/chat/completions
Files
https://aiservice.cn-hangzhou.aliyuncs.com/v1/files
Batch
https://aiservice.cn-hangzhou.aliyuncs.com/v1/batches
模型列表
目前裁判员模型支持的模型列表如下:
模型名称 | 模型介绍 | 上下文长度 | 最大输入 | 最大输出 |
裁判员模型-标准版(pai-judge) | 模型规模较小,性价比更高。 | 32768 | 32768 | 32768 |
裁判员模型-高级版(pai-judge-plus) | 模型规模较大,推理效果更好。 | 32768 | 32768 | 32768 |
单次调用(在线调用)
裁判员模型支持单模型评测和双模型竞技两种评测模式,如果不满足您的业务需求,可以使用自定义模板。
单模型评测
单模型评测指评估单一大语言模型的回答质量。
请求示例
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) completion = client.chat.completions.create( model='pai-judge', messages=[ { "role": "user", "content": [ { "mode": "single", "type": "json", "json": { "question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer": "To cross the river, find the creek." } } ] } ] ) print(completion.model_dump()) if __name__ == '__main__': main()$ curl -X POST https://aiservice.cn-hangzhou.aliyuncs.com/v1/chat/completions \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ "model": "pai-judge", "messages": [ { "role": "user", "content": [ { "mode": "single", "type": "json", "json": { "question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer": "To cross the river, find the creek." } } ] } ] }'返回结果
{ "id": "3b7c3822-1e51-4dc9-b2ad-18b9649a7f19", "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "content": "我认为该回复的综合评分为[[2]],理由如下。\n当前回复的优点:\n1. 相关性:回复直接针对用户的指令,提供了与第一句相对应的第二句,符合相关性标准。[[4]]\n2. 无害性:回复内容适宜,没有包含任何可能冒犯的内容,符合无害性标准。[[5]]\n\n当前回复的不足:\n1. 准确性:回复中的内容“To cross the river, find the creek”并不完全符合“爬山”和“登顶”的逻辑顺序,这与用户指令中的“爬山”概念不完全对应,影响了准确性的体现。[[2]]\n2. 完整性:回复没有全面覆盖问题的各个方面,即没有提供一个完整的故事或对联的第二句,这影响了完整性的实现。[[2]]\n3. 来源可靠性:回复没有提供任何来源信息,虽然这在某些情况下可能不是必须的,但提供来源可以增加回复的可信度。[[3]]\n4. 清晰度和结构:虽然回复结构简单,但因为内容与用户指令的不完全对应,影响了其清晰度和易于理解性的评价。[[3]]\n5. 适应用户水平:回复直接性较强,但因为内容的准确性问题,可能不完全适合对联或传统文学有一定了解的用户。[[3]]\n\n综上所述,虽然回复在相关性和无害性方面做得不错,但在准确性、完整性、来源可靠性、清晰度和结构以及适应用户水平方面存在不足,因此综合评分为2。", "role": "assistant", "function_call": null, "tool_calls": null, "refusal": "" } } ], "created": 1733260, "model": "pai-judge", "object": "chat.completion", "service_tier": "", "system_fingerprint": "", "usage": { "completion_tokens": 333, "prompt_tokens": 790, "total_tokens": 1123 } }
双模型竞技
双模型竞技指评估两个大语言模型对相同问题的回答质量。
请求示例
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) completion = client.chat.completions.create( model='pai-judge', messages=[ { "role": "user", "content": [ { "mode": "pairwise", "type": "json", "json": { "question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer1": "To cross the river, find the creek.", "answer2": "To chase the dream, grasp the star." } } ] } ] ) print(completion.model_dump()) if __name__ == '__main__': main()$ curl -X POST https://aiservice.cn-hangzhou.aliyuncs.com/v1/chat/completions \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ "model": "pai-judge", "messages": [ { "role": "user", "content": [ { "mode": "pairwise", "type": "json", "json": { "question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer1": "To cross the river, find the creek.", "answer2": "To chase the dream, grasp the star." } } ] } ] }'返回结果
{ 'id': 'a7026e5a-64c5-4726-9b10-27072ff34d46', 'choices': [{ 'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': { 'content': '***\n我认为[[两个回复并列]],其中回复1的综合评分为[[4]],回复2的综合评分为[[4]],理由如下:\n1. 准确性:两个回复都准确地提供了与用户指令相关的第二对联,没有出现事实错误或误导信息。[[回复1评分:5]] [[回复2评分:5]]\n2. 相关性:两个回复都直接回应了用户的指令,没有包含任何不必要的信息或背景,完全符合用户的需求。[[回复1评分:5]] [[回复2评分:5]]\n3. 无害性:两个回复均未包含任何可能冒犯的内容,都是积极且正面的表达,符合适宜性和文化敏感性的要求。[[回复1评分:5]] [[回复2评分:5]]\n4. 完整性:两个回复都完整地提供了所需的第二对联,没有遗漏任何关键点。[[回复1评分:5]] [[回复2评分:5]]\n5. 来源可靠性:虽然两个回复均未明确引用外部权威来源,但在此场景下,对联的创作和传递通常不需要外部验证,因此这一点可以适当忽略。[[回复1评分:4]] [[回复2评分:4]]\n6. 清晰度和结构:两个回复都简洁明了,结构清晰,易于理解。[[回复1评分:5]] [[回复2评分:5]]\n7. 时效性:此标准在本场景下不太适用,因为对联文化历史悠久,且两个回复均符合传统表达方式。[[回复1评分:N/A]] [[回复2评分:N/A]]\n8. 适应用户水平:两个回复都考虑到了用户可能的知识水平,使用了易于理解的语言和表达。[[回复1评分:5]] [[回复2评分:5]]\n\n综上所述,两个回复在各项标准下的表现相当,都能很好地满足用户的需求,因此我认为两个回复并列。\n***', 'role': 'assistant', 'function_call': None, 'tool_calls': None, 'refusal': '' } }], 'created': 1734557, 'model': 'pai-judge', 'object': 'chat.completion', 'service_tier': '', 'system_fingerprint': '', 'usage': { 'completion_tokens': 408, 'prompt_tokens': 821, 'total_tokens': 1229 } }
自定义模板
使用上述示例调用裁判员模型后,系统会生成对应的提示词模板,如果该模板不能满足您的需求,您可以自定义评测模板(本文以双模型竞技为例)。
请求示例
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) system = "请作为一名公正的裁判,评价人工智能助手对下面用户问题的回答质量。\n\n" \ "以下是这些人工智能助手的基本性格描述:\n" \ "不会对人进行评价或比较,不会做任何伤害人类的事情。性格上偏向于独立自主的人格。\n" user = \ "请对以下问题-回答按照1-5分进行打分:\n" \ "问题: 你认为社交媒体对人际关系的影响是什么?\n" \ "回复:社交媒体使得人们可以更轻松地联系,但也可能导致疏远。\n" \ "评分标准:\n" \ "1分: 回复完全不相关、无内容或者完全错误。\n" \ "2分: 回复有一些相关性,但内容肤浅或过于简略。\n" \ "3分: 回复相关,提供了一些见解,但缺乏深入分析。\n" \ "4分: 回复相关且有深度,提供了清晰的见解和例证。\n" \ "5分: 回复非常相关且深刻,提供了全面的观点和丰富的例证。" completion = client.chat.completions.create( model='pai-judge', messages=[ {"role": "system", "content": system}, {"role": "user", "content": user} ] ) print(completion.model_dump()) if __name__ == '__main__': main()$ curl -X POST https://aiservice.cn-hangzhou.aliyuncs.com/v1/chat/completions \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ "model": "pai-judge", "messages": [ { "role": "user", "content": [ {"role": "system", "content": "请作为一名公正的裁判,评价人工智能助手对下面用户问题的回答质量。\n\n以下是这些人工智能助手的基本性格描述:\n不会对人进行评价或比较,不会做任何伤害人类的事情。性格上偏向于独立自主的人格。\n"}, { "role": "user", "content": "请对以下问题-回答按照1-5分进行打分:\n" "问题: 你认为社交媒体对人际关系的影响是什么?\n" "回复:社交媒体使得人们可以更轻松地联系,但也可能导致疏远。\n" "评分标准:\n" "1分: 回复完全不相关、无内容或者完全错误。\n" "2分: 回复有一些相关性,但内容肤浅或过于简略。\n" "3分: 回复相关,提供了一些见解,但缺乏深入分析。\n" "4分: 回复相关且有深度,提供了清晰的见解和例证。\n" "5分: 回复非常相关且深刻,提供了全面的观点和丰富的例证。" } ] } ] }'返回结果
{ "id": "e2f72777-ddf5-4ff8-b7dd-4ecefd6e4014", "object": "chat.completion", "created": 1153092, "model": "pai-judge", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "根据提供的评分标准,我会给这个回答打3分。回复“社交媒体使得人们可以更轻松地联系,但也可能导致疏远。”相关性是明确的,直接针对了社交媒体对人际关系的影响。它提到了两个相反的效果:增强联系和导致疏远,这展现了一定的见解。然而,这个回答相对简短,没有进一步展开这两个方面的具体影响或提供实例来支持其观点,因此缺乏深入分析。所以,根据标准,它达到了3分的标准,相关且提供了一些见解,但没有更深层次的探讨。", "refusal": "", "function_call": null, "tool_calls": null}, "finish_reason": "stop", "logprobs": null } ], "usage": { "prompt_tokens": 910, "completion_tokens": 411, "total_tokens": 1321 }, "system_fingerprint": "", "service_tier": "" }
批量调用(离线调用)
步骤一:准备批量数据
批量数据文件需满足以下条件:
单个文件最大不超过10 MB,文件较大可拆分为多个文件上传。
一个账号上传的所有文件大小总和不超过100 GB。
批量处理的API文件格式为.jsonl文件。
其中,每一行包含对API单个请求的详细信息,每行正文字段中的参数必须包含body和唯一的custom_id值。body支持的参数请参见输入参数。
文件格式请参考:
{"custom_id": "request-1", "body": {"model": "pai-judge", "messages": [{"role": "user", "content": [{"mode": "single", "type": "json", "json": {"question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer": "To cross the river, find the creek."}}]}]}} {"custom_id": "request-2", "body": {"model": "pai-judge-plus", "messages": [{"role": "user", "content": [{"mode": "single", "type": "json", "json": {"question": "According to the first couplet, give the second couplet. first couplet: To climb the mountain, reach the peak", "answer": "To cross the river, find the creek."}}]}]}}
步骤二:上传批量数据
通过裁判员模型上传数据接口,将批量数据上传至服务端,获取唯一的file_id值。
请求示例。
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) upload_files = client.files.create( file=open("/home/xxx/input.jsonl", "rb"), purpose="batch", ) print(upload_files.model_dump_json(indent=4)) if __name__ == '__main__': main()$ curl -XPOST https://aiservice.cn-hangzhou.aliyuncs.com/v1/files \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -F purpose="batch" \ -F file="@/home/xxx/input.jsonl"返回结果。
{ "id": "file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", "object": "file", "bytes": 698, "created_at": 1742454203, "filename": "input.jsonl", "purpose": "batch" }
步骤三:创建批量任务
上传文件后,使用输入文件的file_id创建批量任务。
本文假设file_id为file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713。目前,完成窗口只能设置为24小时,创建成功后会返回唯一的batch_id值。
请求示例。
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) create_batches = client.batches.create( endpoint="/v1/chat/completions", input_file_id="file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", completion_window="24h", ) print(create_batches.model_dump_json(indent=4)) if __name__ == '__main__': main()$ curl -XPOST https://aiservice.cn-hangzhou.aliyuncs.com/v1/batches \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" \ -d '{ "input_file_id": "file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", "endpoint": "/v1/chat/completions", "completion_window": "24h" }'返回结果。
{ "id": "batch_66f245a0-88d1-458c-8e1c-a819a5943022", "object": "batch", "endpoint": "/v1/chat/completions", "errors": null, "input_file_id": "file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", "completion_window": "24h", "status": "Creating", "output_file_id": null, "error_file_id": null, "created_at": 1742455213, "in_process_at": null, "expires_at": null, "FinalizingAt": null, "completed_at": null, "failed_at": null, "expired_at": null, "cancelling_at": null, "cancelled_at": null, "request_counts": { "total": 3, "completed": 0, "failed": 0 }, "metadata": null }
步骤四:查看任务状态
通过batch_id查询任务的运行状态,当运行状态为Succeeded后,返回的response会包含生成的文件ID:output_file_id。
请求示例。
import os from openai import OpenAI def main(): base_url = "http://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) retrieve_batches = client.batches.retrieve( batch_id="batch_66f245a0-88d1-458c-8e1c-a819a5943022", ) print(retrieve_batches.model_dump_json(indent=4)) if __name__ == '__main__': main()$ curl -XGET https://aiservice.cn-hangzhou.aliyuncs.com/v1/batches/batch_66f245a0-88d1-458c-8e1c-a819a5943022 \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}"返回结果。
{ "id": "batch_66f245a0-88d1-458c-8e1c-a819a5943022", "object": "batch", "endpoint": "/v1/chat/completions", "errors": null, "input_file_id": "file-batch-EC043540BE1C7BE3F9F2F0A8F47D1713", "completion_window": "24h", "status": "Succeeded", "output_file_id": "file-batch_output-66f245a0-88d1-458c-8e1c-a819a5943022", "error_file_id": null, "created_at": 1742455213, "in_process_at": 1742455640, "expires_at": 1742455640, "FinalizingAt": 1742455889, "completed_at": 1742455889, "failed_at": null, "expired_at": null, "cancelling_at": null, "cancelled_at": null, "request_counts": { "total": 3, "completed": 3, "failed": 0 }, "metadata": null }
步骤五:获取任务结果
通过output_file_id查询并下载生成文件内容。
请求示例。
import os from openai import OpenAI def main(): base_url = "https://aiservice.cn-hangzhou.aliyuncs.com/v1" judge_model_token = os.getenv("JUDGE_MODEL_TOKEN") client = OpenAI( api_key=f'Authorization: Bearer {judge_model_token}', base_url=base_url ) content_files = client.files.content( file_id="file-batch_output-66f245a0-88d1-458c-8e1c-a819a5943022", ) print(content_files) if __name__ == '__main__': main()$ curl -XGET https://aiservice.cn-hangzhou.aliyuncs.com/v1/files/file-batch_output-66f245a0-88d1-458c-8e1c-a819a5943022/content \ -H "Authorization: Bearer ${JUDGE_MODEL_TOKEN}" > output.jsonl返回结果。
{"id":"dcee3584-6f30-9541-a855-873a6d86b7d9","custom_id":"request-1","response":{"status_code":200,"request_id":"dcee3584-6f30-9541-a855-873a6d86b7d9","body":{"created":1737446797,"usage":{"completion_tokens":7,"prompt_tokens":26,"total_tokens":33},"model":"pai-judge","id":"chatcmpl-dcee3584-6f30-9541-a855-873a6d86b7d9","choices":[{"finish_reason":"stop","index":0,"message":{"content":"2+2 equals 4."}}],"object":"chat.completion"}},"error":null} {"id":"dcee3584-6f30-9541-a855-873a6d86b7d9","custom_id":"request-2","response":{"status_code":200,"request_id":"dcee3584-6f30-9541-a855-873a6d86b7d9","body":{"created":1737446797,"usage":{"completion_tokens":7,"prompt_tokens":26,"total_tokens":33},"model":"pai-judge-plus","id":"chatcmpl-dcee3584-6f30-9541-a855-873a6d86b7d9","choices":[{"finish_reason":"stop","index":0,"message":{"content":"2+2 equals 4."}}],"object":"chat.completion"}},"error":null}