Send Requests to BladeLLM via OpenAI-Compatible API - Platform For AI

BladeLLM server provides interfaces compatible with OpenAI /v1/completions and /v1/chat/completions, allowing clients to invoke services by sending HTTP POST requests to the /v1/completions or /v1/chat/completions paths. This topic describes the configurable parameters when calling the service and the parameters in the returned results.

Completions interface

Call example

Command line

# Call EAS service
# Replace <Your EAS Token> with the service Token; replace <service_url> with the service endpoint.
curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: <Your EAS Token>" \
    -d '{"prompt":"hello world", "stream":"true"}' \
    <service_url>/v1/completions

Python

# Python script
import json
import requests
from typing import Dict, List

#  <service_url>: Replace with the service endpoint.
url = "<service_url>/v1/completions"
prompt = "hello world"
req = {
    "prompt": prompt,
    "stream": True,
    "temperature": 0.0,
    "top_p": 0.5,
    "top_k": 10,
    "max_tokens": 300,
}
response = requests.post(
    url,
    json=req,
    # <Your EAS Token>: Replace with the service Token.
    headers={"Content-Type": "application/json", "Authorization": "<Your EAS Token>"},
    stream=True,
)
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
    msg = chunk.decode("utf-8")
    if msg.startswith('data'):
        info = msg[6:]
        if info == '[DONE]':
            break
        else:
            resp = json.loads(info)
            print(resp['choices'][0]['text'], end='', flush=True)

Request parameter configuration description

Parameter	Required	Type	Default value	Description
model	No	string	None	Model name, used to specify the LoRA name.
prompt	Yes	string	None	Input prompt.
max_tokens	No	integer	16	Maximum number of tokens to generate in the request.
echo	No	boolean	False	Whether to return the prompt with the generated result.
seed	No	integer	None	The random seed.
stream	No	boolean	False	Whether to obtain the returned results in a streaming manner.
temperature	No	number	1.0	Controls the randomness and diversity of the generated text. Value range [0,1.0].
top_p	No	number	1.0	From all possible tokens predicted by the model, select the most likely tokens whose probability sum reaches top_p. Value range [0,1.0].
top_k	No	integer	-1	Keep the top_k tokens with the highest probability.
repetition_penalty	No	number	1.0	An important parameter for controlling the diversity of generated text. >1.0: Reduces the likelihood of repeated words or phrases. <1.0: Increases the likelihood of repeated words or phrases. =1.0: No additional penalty or reward for repetition.
presence_penalty	No	number	0.0	Used to control vocabulary diversity in generated text. >0: Words that have already appeared in the generated text will have a lower probability of being selected as the next word. <0: Increases the likelihood of these words being reused. =0: The model will select the next word according to the original probability distribution.
frequency_penalty	No	number	0.0	Used to control the degree of penalty for the frequency of repeated use of words that have already appeared when generating text. >0: Words that have appeared multiple times will have a lower probability of being selected. <0: Increases the likelihood of these words being repeatedly selected. =0: The model will select the next word according to the original probability distribution.
stop (stop_sequences)	No	[string]	None	Used to prompt the model to stop generating when specific text is encountered during text generation. For example ["</s>"]
stop_tokens	No	[int]	None	Used to prompt the model to stop generating when specific Token IDs are encountered during text generation.
ignore_eos	No	boolean	False	Ignore the end marker when generating text.
logprobs	No	integer	None	Return the probability distribution of each possible output token during text generation.
response_format	No	string	None	Used to specify the output format: json_object: Output JSON object. text: Output text.
guided_regex	No	string	None	Regular expression used to guide decoding.
guided_json	No	string (valid JSON string)	None	JSON Scheme in string form, used to constrain and guide decoding to generate JSON.
guided_choice	No	[string]	None	Used to guide decoding to generate a given output.
guided_grammar	No	string	None	EBNF grammar rules used to guide decoding.
guided_whitespace_pattern	No	string	None	Regular expression representing whitespace in JSON mode for guided decoding.

Return result parameter description

Parameter		Description
id		Unique identifier for the request completion.
model		Model name.
choices	finish_reason	The reason why the model stopped generating tokens.
	index	Index. Type is Integer.
	logprobs	Used to represent the confidence level of prediction results. For detailed parameter descriptions, see content parameter description.
	text	Generated text.
object		Object type, STRING type, default is text_completion.
usage	prompt_tokens	Number of tokens in the input prompt.
	completion_tokens	Number of tokens in the generated or completed content.
	total_tokens	Total number of tokens including both input and output.
error_info	code	Error code.
error_info	message	Error message.

Content parameter description

Parameter	Description
id	Token ID.
token	Token text.
logprob	Log probability value.
is_special	Whether it is a special token. Default value is False.
bytes	A list of integers representing the byte sequence of the token in UTF-8 encoding.
top_logprobs	List of most likely tokens and their corresponding log probabilities.

Chat Completions

Call example

Command line

# Call EAS service
# Replace <Your EAS Token> with the service Token; replace <service_url> with the service endpoint.
curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: <Your EAS Token>" \
    -d '{
        "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Hello!"
        }
        ]
    }' \
    <service_url>/v1/chat/completions

Python

# Python script
import json
import requests
from typing import Dict, List

#  <service_url>: Replace with the service endpoint.
url = "<service_url>/v1/chat/completions"
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
            {'role': 'user', 'content': 'Hello!'}]
req = {
    "messages": messages,
    "stream": True,
    "temperature": 0.0,
    "top_p": 0.5,
    "top_k": 10,
    "max_tokens": 300,
}
response = requests.post(
    url,
    json=req,
    # <Your EAS Token>: Replace with the service Token.
    headers={"Content-Type": "application/json", "Authorization": "<Your EAS Token>"},
    stream=True,
)
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
    msg = chunk.decode("utf-8")
    if msg.startswith('data'):
        info = msg[6:]
        if info == '[DONE]':
            break
        else:
            resp = json.loads(info)
            print(resp['choices'][0]['delta']['content'], end='', flush=True)

Request parameter configuration description

Parameter	Required	Type	Default value	Description
model	No	string	None	Model name, used to specify the LoRA name.
message	No	array	None	List of conversation messages. `[ {"role": "user", "content": "hello"}, {"role": "assistant", "content": "what can I do for you?"}, {"role": "user", "content": "what is the capital of Canada?" } ]`
resume_response	No	string	None	When resuming a chat, you need to provide the initial message and continue the conversation by passing the reply at the time of interruption.
max_tokens	No	integer	16	Maximum number of tokens to generate in the request.
echo	No	boolean	False	Whether to return the prompt with the generated result.
seed	No	integer	None	The random seed.
stream	No	boolean	False	Whether to obtain the returned results in a streaming manner.
temperature	No	number	1.0	Controls the randomness and diversity of the generated text. Value range [0,1.0].
top_p	No	number	1.0	From all possible tokens predicted by the model, select the most likely tokens whose probability sum reaches top_p. Value range [0,1.0].
top_k	No	integer	-1	Keep the top_k tokens with the highest probability.
repetition_penalty	No	number	1.0	An important parameter for controlling the diversity of generated text. >1.0: Reduces the likelihood of repeated words or phrases. <1.0: Increases the likelihood of repeated words or phrases. =1.0: No additional penalty or reward for repetition.
presence_penalty	No	number	0.0	Used to control vocabulary diversity in generated text. >0: Words that have already appeared in the generated text will have a lower probability of being selected as the next word. <0: Increases the likelihood of these words being reused. =0: The model will select the next word according to the original probability distribution.
frequency_penalty	No	number	0.0	Used to control the degree of penalty for the frequency of repeated use of words that have already appeared when generating text. >0: Words that have appeared multiple times will have a lower probability of being selected. <0: Increases the likelihood of these words being repeatedly selected. =0: The model will select the next word according to the original probability distribution.
stop (stop_sequences)	No	string	None	Used to prompt the model to stop generating when specific text is encountered during text generation.
stop_tokens	No	[int]	None	Used to prompt the model to stop generating when specific Token IDs are encountered during text generation.
ignore_eos	No	boolean	False	Ignore the end marker when generating text.
logprobs	No	integer	None	Return the probability distribution of each possible output token during text generation.
top_logprobs	No	integer	None	Number of most likely tokens at each token position.
response_format	No	string	None	Used to specify the output format: json_object: Output JSON object. text: Output text.
guided_regex	No	string	None	Regular expression used to guide decoding.
guided_json	No	string (valid JSON string)	None	JSON Scheme in string form, used to constrain and guide decoding to generate JSON.
guided_choice	No	[string]	None	Used to guide decoding to generate a given output.
guided_grammar	No	string	None	EBNF grammar rules used to guide decoding.
guided_whitespace_pattern	No	string	None	Regular expression representing whitespace in JSON mode for guided decoding.

Return result parameter description

Parameter		Description
id		Unique identifier for the request completion.
choices	finish_reason	The reason why the model stopped generating tokens.
	index	Index. Type is integer.
	logprobs	Used to represent the confidence level of prediction results. For detailed parameter descriptions, see content parameter description.
	message	Non-streaming request return result. Represents the conversation message generated by the model.
	delta	Streaming request return result. Represents the conversation message generated by the real-time model.
object		Object type, STRING type, default is text_completion.
usage	prompt_tokens	Number of tokens in the input prompt.
	completion_tokens	Number of tokens in the generated or completed content.
	total_tokens	Total number of tokens including both input and output.
error_info	code	Error code.
error_info	message	Error message.

The following code provides an example of returned results:

Streaming request return result

{
    "id": "78544a80-6224-4b0f-a0c4-4bad94005eb1",
    "choices": [{
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
        "delta": {
            "role": "assistant",
            "content": ""
        }
    }],
    "object": "chat.completion.chunk",
    "usage": {
        "prompt_tokens": 21,
        "completion_tokens": 1,
        "total_tokens": 22
    },
    "error_info": null
}

Non-streaming request return result

{
    "id": "1444c346-3d35-4505-ae73-7ff727d00e8a",
    "choices": [{
        "finish_reason": "",
        "index": 0,
        "logprobs": null,
        "message": {
            "role": "assistant",
            "content": "Hello! How can I assist you today?\n"
        }
    }],
    "object": "chat.completion",
    "usage": {
        "prompt_tokens": 21,
        "completion_tokens": 16,
        "total_tokens": 37
    },
    "error_info": null
}