BladeLLM server provides interfaces compatible with OpenAI /v1/completions and /v1/chat/completions, allowing clients to invoke services by sending HTTP POST requests to the /v1/completions or /v1/chat/completions paths. This topic describes the configurable parameters when calling the service and the parameters in the returned results.
Completions interface
Call example
Command line
# Call EAS service
# Replace <Your EAS Token> with the service Token; replace <service_url> with the service endpoint.
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: <Your EAS Token>" \
-d '{"prompt":"hello world", "stream":"true"}' \
<service_url>/v1/completions
Python
# Python script
import json
import requests
from typing import Dict, List
# <service_url>: Replace with the service endpoint.
url = "<service_url>/v1/completions"
prompt = "hello world"
req = {
"prompt": prompt,
"stream": True,
"temperature": 0.0,
"top_p": 0.5,
"top_k": 10,
"max_tokens": 300,
}
response = requests.post(
url,
json=req,
# <Your EAS Token>: Replace with the service Token.
headers={"Content-Type": "application/json", "Authorization": "<Your EAS Token>"},
stream=True,
)
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
msg = chunk.decode("utf-8")
if msg.startswith('data'):
info = msg[6:]
if info == '[DONE]':
break
else:
resp = json.loads(info)
print(resp['choices'][0]['text'], end='', flush=True)
Request parameter configuration description
|
Parameter |
Required |
Type |
Default value |
Description |
|
model |
No |
string |
None |
Model name, used to specify the LoRA name. |
|
prompt |
Yes |
string |
None |
Input prompt. |
|
max_tokens |
No |
integer |
16 |
Maximum number of tokens to generate in the request. |
|
echo |
No |
boolean |
False |
Whether to return the prompt with the generated result. |
|
seed |
No |
integer |
None |
The random seed. |
|
stream |
No |
boolean |
False |
Whether to obtain the returned results in a streaming manner. |
|
temperature |
No |
number |
1.0 |
Controls the randomness and diversity of the generated text. Value range [0,1.0]. |
|
top_p |
No |
number |
1.0 |
From all possible tokens predicted by the model, select the most likely tokens whose probability sum reaches top_p. Value range [0,1.0]. |
|
top_k |
No |
integer |
-1 |
Keep the top_k tokens with the highest probability. |
|
repetition_penalty |
No |
number |
1.0 |
An important parameter for controlling the diversity of generated text.
|
|
presence_penalty |
No |
number |
0.0 |
Used to control vocabulary diversity in generated text.
|
|
frequency_penalty |
No |
number |
0.0 |
Used to control the degree of penalty for the frequency of repeated use of words that have already appeared when generating text.
|
|
stop (stop_sequences) |
No |
[string] |
None |
Used to prompt the model to stop generating when specific text is encountered during text generation. For example ["</s>"] |
|
stop_tokens |
No |
[int] |
None |
Used to prompt the model to stop generating when specific Token IDs are encountered during text generation. |
|
ignore_eos |
No |
boolean |
False |
Ignore the end marker when generating text. |
|
logprobs |
No |
integer |
None |
Return the probability distribution of each possible output token during text generation. |
|
response_format |
No |
string |
None |
Used to specify the output format:
|
|
guided_regex |
No |
string |
None |
Regular expression used to guide decoding. |
|
guided_json |
No |
string (valid JSON string) |
None |
JSON Scheme in string form, used to constrain and guide decoding to generate JSON. |
|
guided_choice |
No |
[string] |
None |
Used to guide decoding to generate a given output. |
|
guided_grammar |
No |
string |
None |
EBNF grammar rules used to guide decoding. |
|
guided_whitespace_pattern |
No |
string |
None |
Regular expression representing whitespace in JSON mode for guided decoding. |
Return result parameter description
|
Parameter |
Description |
|
|
id |
Unique identifier for the request completion. |
|
|
model |
Model name. |
|
|
choices |
finish_reason |
The reason why the model stopped generating tokens. |
|
index |
Index. Type is Integer. |
|
|
logprobs |
Used to represent the confidence level of prediction results. For detailed parameter descriptions, see content parameter description. |
|
|
text |
Generated text. |
|
|
object |
Object type, STRING type, default is text_completion. |
|
|
usage |
prompt_tokens |
Number of tokens in the input prompt. |
|
completion_tokens |
Number of tokens in the generated or completed content. |
|
|
total_tokens |
Total number of tokens including both input and output. |
|
|
error_info |
code |
Error code. |
|
message |
Error message. |
|
Chat Completions
Call example
Command line
# Call EAS service
# Replace <Your EAS Token> with the service Token; replace <service_url> with the service endpoint.
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: <Your EAS Token>" \
-d '{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}' \
<service_url>/v1/chat/completions
Python
# Python script
import json
import requests
from typing import Dict, List
# <service_url>: Replace with the service endpoint.
url = "<service_url>/v1/chat/completions"
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Hello!'}]
req = {
"messages": messages,
"stream": True,
"temperature": 0.0,
"top_p": 0.5,
"top_k": 10,
"max_tokens": 300,
}
response = requests.post(
url,
json=req,
# <Your EAS Token>: Replace with the service Token.
headers={"Content-Type": "application/json", "Authorization": "<Your EAS Token>"},
stream=True,
)
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
msg = chunk.decode("utf-8")
if msg.startswith('data'):
info = msg[6:]
if info == '[DONE]':
break
else:
resp = json.loads(info)
print(resp['choices'][0]['delta']['content'], end='', flush=True)
Request parameter configuration description
|
Parameter |
Required |
Type |
Default value |
Description |
|
model |
No |
string |
None |
Model name, used to specify the LoRA name. |
|
message |
No |
array |
None |
List of conversation messages.
|
|
resume_response |
No |
string |
None |
When resuming a chat, you need to provide the initial message and continue the conversation by passing the reply at the time of interruption. |
|
max_tokens |
No |
integer |
16 |
Maximum number of tokens to generate in the request. |
|
echo |
No |
boolean |
False |
Whether to return the prompt with the generated result. |
|
seed |
No |
integer |
None |
The random seed. |
|
stream |
No |
boolean |
False |
Whether to obtain the returned results in a streaming manner. |
|
temperature |
No |
number |
1.0 |
Controls the randomness and diversity of the generated text. Value range [0,1.0]. |
|
top_p |
No |
number |
1.0 |
From all possible tokens predicted by the model, select the most likely tokens whose probability sum reaches top_p. Value range [0,1.0]. |
|
top_k |
No |
integer |
-1 |
Keep the top_k tokens with the highest probability. |
|
repetition_penalty |
No |
number |
1.0 |
An important parameter for controlling the diversity of generated text.
|
|
presence_penalty |
No |
number |
0.0 |
Used to control vocabulary diversity in generated text.
|
|
frequency_penalty |
No |
number |
0.0 |
Used to control the degree of penalty for the frequency of repeated use of words that have already appeared when generating text.
|
|
stop (stop_sequences) |
No |
string |
None |
Used to prompt the model to stop generating when specific text is encountered during text generation. |
|
stop_tokens |
No |
[int] |
None |
Used to prompt the model to stop generating when specific Token IDs are encountered during text generation. |
|
ignore_eos |
No |
boolean |
False |
Ignore the end marker when generating text. |
|
logprobs |
No |
integer |
None |
Return the probability distribution of each possible output token during text generation. |
|
top_logprobs |
No |
integer |
None |
Number of most likely tokens at each token position. |
|
response_format |
No |
string |
None |
Used to specify the output format:
|
|
guided_regex |
No |
string |
None |
Regular expression used to guide decoding. |
|
guided_json |
No |
string (valid JSON string) |
None |
JSON Scheme in string form, used to constrain and guide decoding to generate JSON. |
|
guided_choice |
No |
[string] |
None |
Used to guide decoding to generate a given output. |
|
guided_grammar |
No |
string |
None |
EBNF grammar rules used to guide decoding. |
|
guided_whitespace_pattern |
No |
string |
None |
Regular expression representing whitespace in JSON mode for guided decoding. |
Return result parameter description
|
Parameter |
Description |
|
|
id |
Unique identifier for the request completion. |
|
|
choices |
finish_reason |
The reason why the model stopped generating tokens. |
|
index |
Index. Type is integer. |
|
|
logprobs |
Used to represent the confidence level of prediction results. For detailed parameter descriptions, see content parameter description. |
|
|
message |
Non-streaming request return result. Represents the conversation message generated by the model. |
|
|
delta |
Streaming request return result. Represents the conversation message generated by the real-time model. |
|
|
object |
Object type, STRING type, default is text_completion. |
|
|
usage |
prompt_tokens |
Number of tokens in the input prompt. |
|
completion_tokens |
Number of tokens in the generated or completed content. |
|
|
total_tokens |
Total number of tokens including both input and output. |
|
|
error_info |
code |
Error code. |
|
message |
Error message. |
|
The following code provides an example of returned results:
Streaming request return result
{
"id": "78544a80-6224-4b0f-a0c4-4bad94005eb1",
"choices": [{
"finish_reason": "",
"index": 0,
"logprobs": null,
"delta": {
"role": "assistant",
"content": ""
}
}],
"object": "chat.completion.chunk",
"usage": {
"prompt_tokens": 21,
"completion_tokens": 1,
"total_tokens": 22
},
"error_info": null
}
Non-streaming request return result
{
"id": "1444c346-3d35-4505-ae73-7ff727d00e8a",
"choices": [{
"finish_reason": "",
"index": 0,
"logprobs": null,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?\n"
}
}],
"object": "chat.completion",
"usage": {
"prompt_tokens": 21,
"completion_tokens": 16,
"total_tokens": 37
},
"error_info": null
}