BladeLLM server provides interfaces compatible with OpenAI /v1/completions and /v1/chat/completions, allowing clients to invoke services by sending HTTP POST requests to the /v1/completions or /v1/chat/completions paths. This topic describes the configurable parameters when calling the service and the parameters in the returned results.
Completions interface
Call example
Command line
# Call EAS service
# Replace <Your EAS Token> with the service Token; replace <service_url> with the service endpoint.
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: <Your EAS Token>" \
-d '{"prompt":"hello world", "stream":"true"}' \
<service_url>/v1/completionsPython
# Python script
import json
import requests
from typing import Dict, List
# <service_url>: Replace with the service endpoint.
url = "<service_url>/v1/completions"
prompt = "hello world"
req = {
"prompt": prompt,
"stream": True,
"temperature": 0.0,
"top_p": 0.5,
"top_k": 10,
"max_tokens": 300,
}
response = requests.post(
url,
json=req,
# <Your EAS Token>: Replace with the service Token.
headers={"Content-Type": "application/json", "Authorization": "<Your EAS Token>"},
stream=True,
)
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
msg = chunk.decode("utf-8")
if msg.startswith('data'):
info = msg[6:]
if info == '[DONE]':
break
else:
resp = json.loads(info)
print(resp['choices'][0]['text'], end='', flush=True)Request parameter configuration description
Parameter | Required | Type | Default value | Description |
model | No | string | None | Model name, used to specify the LoRA name. |
prompt | Yes | string | None | Input prompt. |
max_tokens | No | integer | 16 | Maximum number of tokens to generate in the request. |
echo | No | boolean | False | Whether to return the prompt with the generated result. |
seed | No | integer | None | The random seed. |
stream | No | boolean | False | Whether to obtain the returned results in a streaming manner. |
temperature | No | number | 1.0 | Controls the randomness and diversity of the generated text. Value range [0,1.0]. |
top_p | No | number | 1.0 | From all possible tokens predicted by the model, select the most likely tokens whose probability sum reaches top_p. Value range [0,1.0]. |
top_k | No | integer | -1 | Keep the top_k tokens with the highest probability. |
repetition_penalty | No | number | 1.0 | An important parameter for controlling the diversity of generated text.
|
presence_penalty | No | number | 0.0 | Used to control vocabulary diversity in generated text.
|
frequency_penalty | No | number | 0.0 | Used to control the degree of penalty for the frequency of repeated use of words that have already appeared when generating text.
|
stop (stop_sequences) | No | [string] | None | Used to prompt the model to stop generating when specific text is encountered during text generation. For example ["</s>"] |
stop_tokens | No | [int] | None | Used to prompt the model to stop generating when specific Token IDs are encountered during text generation. |
ignore_eos | No | boolean | False | Ignore the end marker when generating text. |
logprobs | No | integer | None | Return the probability distribution of each possible output token during text generation. |
response_format | No | string | None | Used to specify the output format:
|
guided_regex | No | string | None | Regular expression used to guide decoding. |
guided_json | No | string (valid JSON string) | None | JSON Scheme in string form, used to constrain and guide decoding to generate JSON. |
guided_choice | No | [string] | None | Used to guide decoding to generate a given output. |
guided_grammar | No | string | None | EBNF grammar rules used to guide decoding. |
guided_whitespace_pattern | No | string | None | Regular expression representing whitespace in JSON mode for guided decoding. |
Return result parameter description
Parameter | Description | |
id | Unique identifier for the request completion. | |
model | Model name. | |
choices | finish_reason | The reason why the model stopped generating tokens. |
index | Index. Type is Integer. | |
logprobs | Used to represent the confidence level of prediction results. For detailed parameter descriptions, see content parameter description. | |
text | Generated text. | |
object | Object type, STRING type, default is text_completion. | |
usage | prompt_tokens | Number of tokens in the input prompt. |
completion_tokens | Number of tokens in the generated or completed content. | |
total_tokens | Total number of tokens including both input and output. | |
error_info | code | Error code. |
message | Error message. | |
Chat Completions
Call example
Command line
# Call EAS service
# Replace <Your EAS Token> with the service Token; replace <service_url> with the service endpoint.
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: <Your EAS Token>" \
-d '{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}' \
<service_url>/v1/chat/completionsPython
# Python script
import json
import requests
from typing import Dict, List
# <service_url>: Replace with the service endpoint.
url = "<service_url>/v1/chat/completions"
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Hello!'}]
req = {
"messages": messages,
"stream": True,
"temperature": 0.0,
"top_p": 0.5,
"top_k": 10,
"max_tokens": 300,
}
response = requests.post(
url,
json=req,
# <Your EAS Token>: Replace with the service Token.
headers={"Content-Type": "application/json", "Authorization": "<Your EAS Token>"},
stream=True,
)
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
msg = chunk.decode("utf-8")
if msg.startswith('data'):
info = msg[6:]
if info == '[DONE]':
break
else:
resp = json.loads(info)
print(resp['choices'][0]['delta']['content'], end='', flush=True)Request parameter configuration description
Parameter | Required | Type | Default value | Description |
model | No | string | None | Model name, used to specify the LoRA name. |
message | No | array | None | List of conversation messages. |
resume_response | No | string | None | When resuming a chat, you need to provide the initial message and continue the conversation by passing the reply at the time of interruption. |
max_tokens | No | integer | 16 | Maximum number of tokens to generate in the request. |
echo | No | boolean | False | Whether to return the prompt with the generated result. |
seed | No | integer | None | The random seed. |
stream | No | boolean | False | Whether to obtain the returned results in a streaming manner. |
temperature | No | number | 1.0 | Controls the randomness and diversity of the generated text. Value range [0,1.0]. |
top_p | No | number | 1.0 | From all possible tokens predicted by the model, select the most likely tokens whose probability sum reaches top_p. Value range [0,1.0]. |
top_k | No | integer | -1 | Keep the top_k tokens with the highest probability. |
repetition_penalty | No | number | 1.0 | An important parameter for controlling the diversity of generated text.
|
presence_penalty | No | number | 0.0 | Used to control vocabulary diversity in generated text.
|
frequency_penalty | No | number | 0.0 | Used to control the degree of penalty for the frequency of repeated use of words that have already appeared when generating text.
|
stop (stop_sequences) | No | string | None | Used to prompt the model to stop generating when specific text is encountered during text generation. |
stop_tokens | No | [int] | None | Used to prompt the model to stop generating when specific Token IDs are encountered during text generation. |
ignore_eos | No | boolean | False | Ignore the end marker when generating text. |
logprobs | No | integer | None | Return the probability distribution of each possible output token during text generation. |
top_logprobs | No | integer | None | Number of most likely tokens at each token position. |
response_format | No | string | None | Used to specify the output format:
|
guided_regex | No | string | None | Regular expression used to guide decoding. |
guided_json | No | string (valid JSON string) | None | JSON Scheme in string form, used to constrain and guide decoding to generate JSON. |
guided_choice | No | [string] | None | Used to guide decoding to generate a given output. |
guided_grammar | No | string | None | EBNF grammar rules used to guide decoding. |
guided_whitespace_pattern | No | string | None | Regular expression representing whitespace in JSON mode for guided decoding. |
Return result parameter description
Parameter | Description | |
id | Unique identifier for the request completion. | |
choices | finish_reason | The reason why the model stopped generating tokens. |
index | Index. Type is integer. | |
logprobs | Used to represent the confidence level of prediction results. For detailed parameter descriptions, see content parameter description. | |
message | Non-streaming request return result. Represents the conversation message generated by the model. | |
delta | Streaming request return result. Represents the conversation message generated by the real-time model. | |
object | Object type, STRING type, default is text_completion. | |
usage | prompt_tokens | Number of tokens in the input prompt. |
completion_tokens | Number of tokens in the generated or completed content. | |
total_tokens | Total number of tokens including both input and output. | |
error_info | code | Error code. |
message | Error message. | |
The following code provides an example of returned results:
Streaming request return result
{
"id": "78544a80-6224-4b0f-a0c4-4bad94005eb1",
"choices": [{
"finish_reason": "",
"index": 0,
"logprobs": null,
"delta": {
"role": "assistant",
"content": ""
}
}],
"object": "chat.completion.chunk",
"usage": {
"prompt_tokens": 21,
"completion_tokens": 1,
"total_tokens": 22
},
"error_info": null
}Non-streaming request return result
{
"id": "1444c346-3d35-4505-ae73-7ff727d00e8a",
"choices": [{
"finish_reason": "",
"index": 0,
"logprobs": null,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?\n"
}
}],
"object": "chat.completion",
"usage": {
"prompt_tokens": 21,
"completion_tokens": 16,
"total_tokens": 37
},
"error_info": null
}