In streaming output mode, the model generates and returns intermediate results in real-time instead of one final response. This reduces the wait time and request timeout risks.
Overview
In streaming output mode, the model returns real-time intermediate results. You can read as the model outputs, thereby shortening the wait for the model's response. This is particularly effective in reducing request timeout risks when dealing with lengthy outputs.
Error message for request timeout:Request timed out, please try again later.
orResponse timeout
.
You can compare the performance of streaming output and non-streaming output:
This is for reference only and no requests are actually sent.
How to use
Prerequisites
You must first obtain an API key and set the API key as an environment variable. If you need to use OpenAI SDK or DashScope SDK, you must install the SDK.
Get started
OpenAI compatible
To enable streaming output mode, simply set stream
to true.
Python
By default, the streaming output does not return the amount of tokens used for the current request. You can enable this by settingstream_options
to{"include_usage": True}
, which will make the last returned chunk include the token usage for the current request.
In the future, we will set the default value ofstream_options
to{"include_usage": True}
. As a result, thechoices
field of the last chunk will be an empty list. We recommend referencing the latest code in this topic and adding a conditional check in your business code usingif chunk.choices:
.
import os
from openai import OpenAI
client = OpenAI(
# If the environment variable is not configured, replace the following line with: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-plus", # qwen-plus is used as an example. You can use other models in the model list: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"}
],
stream=True
)
full_content = ""
print("Streaming output content is:")
for chunk in completion:
# If stream_options.include_usage is True, the choices field of the last chunk is empty and need to be skipped. Obtain token usage from chunk.usage instead.
if chunk.choices:
full_content += chunk.choices[0].delta.content
print(chunk.choices[0].delta.content)
print(f"Full content is: {full_content}")
Sample response
Streaming output content is:
I am a
large
language model
from Alibaba Cloud
. I am
called
Qwen.
Full content is: I am a large language model from Alibaba Cloud. I am called Qwen.
Node.js
By default, the streaming output does not return the amount of tokens used for the current request. You can enable this by settingstream_options
to{"include_usage": true}
, which will make the last returned chunk include the token usage for the current request.
In the future, we will set the default value ofstream_options
to{"include_usage": true}
. As a result, thechoices
field of the last chunk will be an empty list. We recommend referencing the latest code in this topic and addif (Array.isArray(chunk.choices) && chunk.choices.length > 0
in your business code.
import OpenAI from "openai";
const openai = new OpenAI(
{
// If the environment variable is not configured, replace the following line with: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
}
);
const completion = await openai.chat.completions.create({
model: "qwen-plus", // qwen-plus is used as an example. You can use other models in the model list: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
messages: [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"}
],
stream: true,
stream_options: {
include_usage: true
}
});
let fullContent = "";
console.log("Streaming output content is:")
for await (const chunk of completion) {
// If stream_options.include_usage is true, the choices field of the last chunk is empty and need to be skipped. Obtain token usage from chunk.usage instead.
if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
fullContent = fullContent + chunk.choices[0].delta.content;
console.log(chunk.choices[0].delta.content);
}
}
console.log("\nFull content is:")
console.log(fullContent);
Sample response
Streaming output content is:
I am a
large
language model
from Alibaba Cloud
. I am
called
Qwen
Full content is:
I am a large language model from Alibaba Cloud. I am called Qwen.
cURL
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-plus",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
],
"stream":true,
"stream_options":{
"include_usage":true
}
}'
Sample response
data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"finish_reason":null,"delta":{"content":"I am"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":"a"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":"large language"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":"model from"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":"Alibaba Cloud"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":", called Qwen."},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"finish_reason":"stop","delta":{"content":""},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":22,"completion_tokens":17,"total_tokens":39},"created":1726132850,"system_fingerprint":null,"model":"qwen-max","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: [DONE]
DashScope
For the Python SDK, set the stream
parameter to True
.
For the Java SDK, use the streamCall
interface.
For HTTP, set the Header parameter X-DashScope-SSE
to enable
.
By default, the streaming output is non-incremental, meaning each return includes all previously generated content. To use incremental streaming output mode, set theincremental_output
(incrementalOutput
for Java) parameter totrue
.
Python
import os
from dashscope import Generation
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{'role':'system','content':'you are a helpful assistant'},
{'role': 'user','content': 'Who are you?'}]
responses = Generation.call(
# If the environment variable is not configured, replace the following line with: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-plus",
messages=messages,
result_format='message',
stream=True,
# Incremental streaming data
incremental_output=True
)
full_content = ""
print("Streaming output content is:")
for response in responses:
full_content += response.output.choices[0].message.content
print(response.output.choices[0].message.content)
print(f"Full content is: {full_content}")
Sample response
Streaming output content is:
I am a
large
language model
from Alibaba Cloud
. I am
called
Qwen
Full content is: I am a large language model from Alibaba Cloud. I am called Qwen.
Java
import java.util.Arrays;
import java.lang.System;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.protocol.Protocol;
public class Main {
private static final Logger logger = LoggerFactory.getLogger(Main.class);
private static StringBuilder fullContent = new StringBuilder();
private static void handleGenerationResult(GenerationResult message) {
String content = message.getOutput().getChoices().get(0).getMessage().getContent();
fullContent.append(content);
System.out.println(content);
}
public static void streamCallWithMessage(Generation gen, Message userMsg)
throws NoApiKeyException, ApiException, InputRequiredException {
GenerationParam param = buildGenerationParam(userMsg);
System.out.println("Streaming output content is:");
Flowable<GenerationResult> result = gen.streamCall(param);
result.blockingForEach(message -> handleGenerationResult(message));
System.out.println("Full content is: " + fullContent.toString());
}
private static GenerationParam buildGenerationParam(Message userMsg) {
return GenerationParam.builder()
// If the environment variable is not configured, replace the following line with: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-plus")
.messages(Arrays.asList(userMsg))
.resultFormat(GenerationParam.ResultFormat.MESSAGE)
// Enable incremental streaming data
.incrementalOutput(true)
.build();
}
public static void main(String[] args) {
try {
Generation gen = new Generation(Protocol.HTTP.getValue(), "https://dashscope-intl.aliyuncs.com/api/v1");
Message userMsg = Message.builder().role(Role.USER.getValue()).content("Who are you?").build();
streamCallWithMessage(gen, userMsg);
} catch (ApiException | NoApiKeyException | InputRequiredException e) {
logger.error("An exception occurred: {}", e.getMessage());
}
System.exit(0);
}
}
Sample response
Streaming output content is:
I am a
large
language model
from Alibaba Cloud
. I am
called
Qwen
Full content is:
I am a large language model from Alibaba Cloud. I am called Qwen.
cURL
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/text-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
"model": "qwen-plus",
"input":{
"messages":[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
},
"parameters": {
"result_format": "message",
"incremental_output":true
}
}'
Sample response
id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"I am","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":23,"input_tokens":22,"output_tokens":1},"request_id":"xxx"}
id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"Qwen","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":24,"input_tokens":22,"output_tokens":2},"request_id":"xxx"}
id:3
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":", an","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":25,"input_tokens":22,"output_tokens":3},"request_id":"xxx"}
id:4
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"AI","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":30,"input_tokens":22,"output_tokens":8},"request_id":"xxx"}
id:5
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"assistant developed by Alibaba","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":38,"input_tokens":22,"output_tokens":16},"request_id":"xxx"}
id:6
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"Cloud. I am designed to answer various questions, provide information","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":46,"input_tokens":22,"output_tokens":24},"request_id":"xxx"}
id:7
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"and engage in conversations with users. How can I","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":54,"input_tokens":22,"output_tokens":32},"request_id":"xxx"}
id:8
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"assist you?","role":"assistant"},"finish_reason":"stop"}]},"usage":{"total_tokens":58,"input_tokens":22,"output_tokens":36},"request_id":"xxx"}
Error code
If the call failed and an error message is returned, see Error messages.
FAQ
Q1: Does the streaming output mode affect the model's response quality?
A1: No, enabling streaming output mode does not impact the quality of the response.
Q2: Is there an additional charge for using the streaming output mode feature?
A2: There is no extra charge. Streaming output mode is billed in the same manner as non-streaming output mode, based on the number of input and output tokens.