In real-time chat or long-text generation applications, long wait times degrade user experience and may trigger server-side timeouts, causing tasks to fail. Streaming output addresses these issues by continuously returning fragments of text as the model generates them.
How it works
Streaming output uses the Server-Sent Events (SSE) protocol. After a streaming request starts, the server establishes an HTTP persistent connection with the client. Each time the model generates a text block (called a chunk), it immediately pushes it through this connection. Once all content is generated, the server sends an end signal.
The client listens to the event stream and receives and processes text chunks in real time—for example, rendering characters one by one on the interface. This contrasts with non-streaming calls, which return all content at once.
The components above are for reference only and do not send actual requests.
Billing
Streaming output uses the same billing rule as non-streaming calls, charging based on the number of input tokens and output tokens in the request.
If a request is interrupted, output tokens are counted only for the portion generated before the server received the termination request.
How to use
Qwen3 open-source edition, QwQ commercial and open-source editions, QVQ, and Qwen-Omni support only streaming output.
Step 1: Configure your API key and select a region
You must have obtained an API key and configured it as an environment variable.
Configuring your API key as an environment variable (DASHSCOPE_API_KEY) is more secure than hard coding it in your code.
Step 2: Make a streaming request
OpenAI compatible
-
How to enable
Set
streamtotrue. -
View token usage
The OpenAI protocol does not return token usage by default. Set
stream_options={"include_usage": true}so the last returned data chunk includes token usage information.
Python
import os
from openai import OpenAI
# 1. Prepare: Initialize the client
client = OpenAI(
# Configure the API key using an environment variable to avoid hard coding.
api_key=os.environ["DASHSCOPE_API_KEY"],
# The API key is tightly bound to a region. Ensure base_url matches the region of your API key.
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)
# 2. Make a streaming request
completion = client.chat.completions.create(
model="qwen-plus",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Please introduce yourself"}
],
stream=True,
stream_options={"include_usage": True}
)
# 3. Handle the streaming response
# Store response fragments in a list. Joining them at the end is more efficient than repeated string concatenation.
content_parts = []
print("AI: ", end="", flush=True)
for chunk in completion:
if chunk.choices:
content = chunk.choices[0].delta.content or ""
print(content, end="", flush=True)
content_parts.append(content)
elif chunk.usage:
print("\n--- Request usage ---")
print(f"Input Tokens: {chunk.usage.prompt_tokens}")
print(f"Output Tokens: {chunk.usage.completion_tokens}")
print(f"Total Tokens: {chunk.usage.total_tokens}")
full_response = "".join(content_parts)
# print(f"\n--- Full response ---\n{full_response}")
Response
AI: Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can answer questions, create content such as stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I support multiple languages, including but not limited to Chinese, English, German, French, and Spanish. If you have any questions or need help, feel free to ask me anytime!
--- Request usage ---
Input Tokens: 26
Output Tokens: 87
Total Tokens: 113
Node.js
import OpenAI from "openai";
async function main() {
// 1. Prepare: Initialize the client
// Configure the API key using an environment variable to avoid hard coding.
if (!process.env.DASHSCOPE_API_KEY) {
throw new Error("Set the DASHSCOPE_API_KEY environment variable");
}
// The API key is tightly bound to a region. Ensure baseURL matches the region of your API key.
// Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
const client = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
});
try {
// 2. Make a streaming request
const stream = await client.chat.completions.create({
model: "qwen-plus",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Please introduce yourself" },
],
stream: true,
// Purpose: Get token usage in the last chunk.
stream_options: { include_usage: true },
});
// 3. Handle the streaming response
const contentParts = [];
process.stdout.write("AI: ");
for await (const chunk of stream) {
// The last chunk contains no choices but includes usage information.
if (chunk.choices && chunk.choices.length > 0) {
const content = chunk.choices[0]?.delta?.content || "";
process.stdout.write(content);
contentParts.push(content);
} else if (chunk.usage) {
// Request complete. Print token usage.
console.log("\n--- Request usage ---");
console.log(`Input Tokens: ${chunk.usage.prompt_tokens}`);
console.log(`Output Tokens: ${chunk.usage.completion_tokens}`);
console.log(`Total Tokens: ${chunk.usage.total_tokens}`);
}
}
const fullResponse = contentParts.join("");
// console.log(`\n--- Full response ---\n${fullResponse}`);
} catch (error) {
console.error("Request failed:", error);
}
}
main();
Response
AI: Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can answer questions, create content such as stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I support multiple languages, including but not limited to Chinese, English, German, French, and Spanish. If you have any questions or need help, feel free to ask me anytime!
--- Request usage ---
Input Tokens: 26
Output Tokens: 89
Total Tokens: 115
curl
Request
# ======= Important notes =======
# Ensure the DASHSCOPE_API_KEY environment variable is set
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Delete this comment before running ===
curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
--no-buffer \
-d '{
"model": "qwen-plus",
"messages": [
{"role": "user", "content": "Who are you?"}
],
"stream": true,
"stream_options": {"include_usage": true}
}'
Response
The response follows the SSE protocol. Each line starting with data: represents a data chunk.
data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"finish_reason":null,"delta":{"content":"I am"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":" from"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":" Alibaba"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":"'s large-scale language"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":" model, my name is Qwen"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"delta":{"content":"."},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[{"finish_reason":"stop","delta":{"content":""},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":22,"completion_tokens":17,"total_tokens":39},"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}
data: [DONE]
-
data:: The message payload, usually a JSON string. -
[DONE]: Indicates the end of the entire streaming response.
DashScope
-
How to enable
The method to enable streaming output varies by SDK or tool:
-
Python SDK: Set the
streamparameter toTrue. -
Java SDK: Use the
streamCallinterface. -
cURL: Set the header
X-DashScope-SSEtoenable.
-
-
Enable incremental output
The DashScope protocol supports both incremental and non-incremental streaming output:
-
Incremental (recommended): Each data chunk contains only newly generated content. Set
incremental_outputtotrueto enable incremental streaming output.Example: ["I love","eating","apples"]
-
Non-incremental: Each data chunk contains all previously generated content, wasting network bandwidth and increasing client processing load. Set
incremental_outputtofalseto enable non-incremental streaming output.Example: ["I love","I love eating","I love eating apples"]
-
-
View token usage
Each data chunk includes real-time token usage information.
Python
import os
from http import HTTPStatus
import dashscope
from dashscope import Generation
# 1. Prepare: Configure the API key and region
# Configure the API key using an environment variable to avoid hard coding.
try:
dashscope.api_key = os.environ["DASHSCOPE_API_KEY"]
except KeyError:
raise ValueError("Set the DASHSCOPE_API_KEY environment variable")
# The API key is tightly bound to a region. Ensure base_url matches the region of your API key.
dashscope.base_http_api_url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1"
# 2. Make a streaming request
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Please introduce yourself"},
]
try:
responses = Generation.call(
model="qwen-plus",
messages=messages,
result_format="message",
stream=True,
# Key: Set to True for incremental output, which offers better performance.
incremental_output=True,
)
# 3. Handle the streaming response
content_parts = []
print("AI: ", end="", flush=True)
for resp in responses:
if resp.status_code == HTTPStatus.OK:
content = resp.output.choices[0].message.content
print(content, end="", flush=True)
content_parts.append(content)
# Check if this is the last packet
if resp.output.choices[0].finish_reason == "stop":
usage = resp.usage
print("\n--- Request usage ---")
print(f"Input Tokens: {usage.input_tokens}")
print(f"Output Tokens: {usage.output_tokens}")
print(f"Total Tokens: {usage.total_tokens}")
else:
# Handle errors
print(
f"\nRequest failed: request_id={resp.request_id}, code={resp.code}, message={resp.message}"
)
break
full_response = "".join(content_parts)
# print(f"\n--- Full response ---\n{full_response}")
except Exception as e:
print(f"An unknown error occurred: {e}")
Response
AI: Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can help you answer questions, create content such as stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I support multiple languages, including but not limited to Chinese, English, German, French, and Spanish. If you have any questions or need help, feel free to ask me anytime!
--- Request usage ---
Input Tokens: 26
Output Tokens: 91
Total Tokens: 117
Java
import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.Role;
import io.reactivex.Flowable;
import io.reactivex.schedulers.Schedulers;
import java.util.Arrays;
import java.util.concurrent.CountDownLatch;
import com.alibaba.dashscope.protocol.Protocol;
public class Main {
public static void main(String[] args) {
// 1. Get the API key
String apiKey = System.getenv("DASHSCOPE_API_KEY");
if (apiKey == null || apiKey.isEmpty()) {
System.err.println("Set the DASHSCOPE_API_KEY environment variable");
return;
}
// 2. Initialize the Generation instance
// The API key is tightly bound to a region. Ensure baseUrl matches the region of your API key.
Generation gen = new Generation(Protocol.HTTP.getValue(), "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1");
CountDownLatch latch = new CountDownLatch(1);
// 3. Build request parameters
GenerationParam param = GenerationParam.builder()
.apiKey(apiKey)
.model("qwen-plus")
.messages(Arrays.asList(
Message.builder()
.role(Role.USER.getValue())
.content("Introduce yourself")
.build()
))
.resultFormat(GenerationParam.ResultFormat.MESSAGE)
.incrementalOutput(true) // Enable incremental output for streaming
.build();
// 4. Make a streaming call and handle the response
try {
Flowable<GenerationResult> result = gen.streamCall(param);
StringBuilder fullContent = new StringBuilder();
System.out.print("AI: ");
result
.subscribeOn(Schedulers.io()) // Execute request on IO thread
.observeOn(Schedulers.computation()) // Process response on computation thread
.subscribe(
// onNext: Handle each response fragment
message -> {
String content = message.getOutput().getChoices().get(0).getMessage().getContent();
String finishReason = message.getOutput().getChoices().get(0).getFinishReason();
// Output content
System.out.print(content);
fullContent.append(content);
// When finishReason is not null, it indicates the last chunk. Output usage info.
if (finishReason != null && !"null".equals(finishReason)) {
System.out.println("\n--- Request usage ---");
System.out.println("Input Tokens: " + message.getUsage().getInputTokens());
System.out.println("Output Tokens: " + message.getUsage().getOutputTokens());
System.out.println("Total Tokens: " + message.getUsage().getTotalTokens());
}
System.out.flush(); // Flush output immediately
},
// onError: Handle errors
error -> {
System.err.println("\nRequest failed: " + error.getMessage());
latch.countDown();
},
// onComplete: Completion callback
() -> {
System.out.println(); // New line
// System.out.println("Full response: " + fullContent.toString());
latch.countDown();
}
);
// Main thread waits for the async task to complete
latch.await();
System.out.println("Program execution complete");
} catch (Exception e) {
System.err.println("Request exception: " + e.getMessage());
e.printStackTrace();
}
}
}
Response
AI: Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can help you answer questions, create content such as stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I support multiple languages, including but not limited to Chinese, English, German, French, and Spanish. If you have any questions or need help, feel free to ask me anytime!
--- Request usage ---
Input Tokens: 26
Output Tokens: 91
Total Tokens: 117
curl
Request
# ======= Important notes =======
# Ensure the DASHSCOPE_API_KEY environment variable is set
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
# === Delete this comment before running ===
curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/text-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
"model": "qwen-plus",
"input":{
"messages":[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
},
"parameters": {
"result_format": "message",
"incremental_output":true
}
}'
Response
The response follows the Server-Sent Events (SSE) format. Each message includes:
-
id: Data chunk number.
-
event: Event type, always "result".
-
HTTP status code information.
-
data: JSON-formatted data.
id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"I am","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":27,"output_tokens":1,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}
id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"Qwen","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":30,"output_tokens":4,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}
id:3
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":" from Alibaba","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":33,"output_tokens":7,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}
...
id:13
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"or need help, feel free to","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":90,"output_tokens":64,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}
id:14
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"ask me!","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":92,"output_tokens":66,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}
id:15
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","role":"assistant"},"finish_reason":"stop"}]},"usage":{"total_tokens":92,"output_tokens":66,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}
Streaming output for multimodal models
Multimodal models support adding images, audio, and other content to conversations. Their streaming output implementation differs from text-only models in the following ways:
-
User message construction: Multimodal model inputs include not only text but also images, audio, and other multimodal information.
-
DashScope SDK interface: Use the MultiModalConversation interface in the DashScope Python SDK. Use the MultiModalConversation class in the DashScope Java SDK.
For multimodal models, see Image and video understanding, Text extraction, Audio understanding—Qwen3-Omni-Captioner, Kimi, etc. The Qwen-Omni model supports only streaming output because its output can include text or audio and other multimodal content. Its result parsing differs from other models. For details, see Omni-modal.
OpenAI compatible
Python
from openai import OpenAI
import os
client = OpenAI(
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you haven't configured an environment variable, replace the next line with your Model Studio API key: api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen3-vl-plus", # Replace with other multimodal models as needed and adjust messages accordingly
messages=[
{"role": "user",
"content": [{"type": "image_url",
"image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
{"type": "text", "text": "What scene is depicted in the image?"}]}],
stream=True,
# stream_options={"include_usage": True}
)
full_content = ""
print("Streaming output content:")
for chunk in completion:
# If stream_options.include_usage is True, the last chunk's choices field is an empty list and should be skipped (token usage can be obtained via chunk.usage)
if chunk.choices and chunk.choices[0].delta.content != "":
full_content += chunk.choices[0].delta.content
print(chunk.choices[0].delta.content)
print(f"Full content: {full_content}")Node.js
import OpenAI from "openai";
const openai = new OpenAI(
{
// API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If you haven't configured an environment variable, replace the next line with your Model Studio API key: apiKey: "sk-xxx"
apiKey: process.env.DASHSCOPE_API_KEY,
// Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
baseURL: "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1"
}
);
const completion = await openai.chat.completions.create({
model: "qwen3-vl-plus", // Replace with other multimodal models as needed and adjust messages accordingly
messages: [
{role: "user",
content: [{"type": "image_url",
"image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
{"type": "text", "text": "What scene is depicted in the image?"}]}],
stream: true,
// stream_options: { include_usage: true },
});
let fullContent = ""
console.log("Streaming output content:")
for await (const chunk of completion) {
// If stream_options.include_usage is true, the last chunk's choices field is an empty array and should be skipped (token usage can be obtained via chunk.usage)
if (chunk.choices[0] && chunk.choices[0].delta.content != null) {
fullContent += chunk.choices[0].delta.content;
console.log(chunk.choices[0].delta.content);
}
}
console.log(`Full output content: ${fullContent}`)curl
# ======= Important notes =======
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Delete this comment before running ===
curl --location 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen3-vl-plus",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
}
},
{
"type": "text",
"text": "What scene is depicted in the image?"
}
]
}
],
"stream":true,
"stream_options":{"include_usage":true}
}'DashScope
Python
import os
from dashscope import MultiModalConversation
import dashscope
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
dashscope.base_http_api_url = 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1'
messages = [
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"text": "What scene is depicted in the image?"}
]
}
]
responses = MultiModalConversation.call(
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you haven't configured an environment variable, replace the next line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model='qwen3-vl-plus', # Replace with other multimodal models as needed and adjust messages accordingly
messages=messages,
stream=True,
incremental_output=True)
full_content = ""
print("Streaming output content:")
for response in responses:
if response["output"]["choices"][0]["message"].content:
print(response.output.choices[0].message.content[0]['text'])
full_content += response.output.choices[0].message.content[0]['text']
print(f"Full content: {full_content}")Java
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
Constants.baseHttpApiUrl="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";
}
public static void streamCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
// must create mutable map.
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
Collections.singletonMap("text", "What scene is depicted in the image?"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If you haven't configured an environment variable, replace the next line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-vl-plus") // Replace with other multimodal models as needed and adjust messages accordingly
.messages(Arrays.asList(userMessage))
.incrementalOutput(true)
.build();
Flowable<MultiModalConversationResult> result = conv.streamCall(param);
result.blockingForEach(item -> {
try {
List<Map<String, Object>> content = item.getOutput().getChoices().get(0).getMessage().getContent();
// Check if content exists and is not empty
if (content != null && !content.isEmpty()) {
System.out.println(content.get(0).get("text"));
}
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
streamCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
# ======= Important notes =======
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
# === Delete this comment before running ===
curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
"model": "qwen3-vl-plus",
"input":{
"messages":[
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"text": "What scene is depicted in the image?"}
]
}
]
},
"parameters": {
"incremental_output": true
}
}'Streaming output for thinking models
Thinking models first return reasoning_content (the thought process), then return content (the response). Determine whether the current stage is thinking or responding based on the data packet status.
For details on thinking models, see Deep thinking, Image and video understanding, Visual reasoning.
For streaming output implementation of Qwen3-Omni-Flash (thinking mode), see Omni-modal.
OpenAI compatible
Below is the response format when calling the thinking mode of the qwen-plus model using the OpenAI Python SDK in streaming mode:
# Thinking stage
...
ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=None, reasoning_content='Cover all key points while')
ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=None, reasoning_content='remaining natural and fluent.')
# Response stage
ChoiceDelta(content='Hello! I am **Qwen', function_call=None, refusal=None, role=None, tool_calls=None, reasoning_content=None)
ChoiceDelta(content='** (', function_call=None, refusal=None, role=None, tool_calls=None, reasoning_content=None)
...
-
If
reasoning_contentis not None andcontentisNone, the current stage is thinking. -
If
reasoning_contentis None andcontentis notNone, the current stage is responding. -
If both are
None, the stage remains the same as the previous packet.
Python
Example code
from openai import OpenAI
import os
# Initialize the OpenAI client
client = OpenAI(
# If you haven't configured an environment variable, replace with your Alibaba Cloud Model Studio API key: api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)
messages = [{"role": "user", "content": "Who are you"}]
completion = client.chat.completions.create(
model="qwen-plus", # Replace with other deep-thinking models as needed
messages=messages,
# The enable_thinking parameter enables the thinking process. This parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, and QwQ.
extra_body={"enable_thinking": True},
stream=True,
# stream_options={
# "include_usage": True
# },
)
reasoning_content = "" # Full thought process
answer_content = "" # Full response
is_answering = False # Whether in the response stage
print("\n" + "=" * 20 + "Thought process" + "=" * 20 + "\n")
for chunk in completion:
if not chunk.choices:
print("\nUsage:")
print(chunk.usage)
continue
delta = chunk.choices[0].delta
# Collect only thinking content
if hasattr(delta, "reasoning_content") and delta.reasoning_content is not None:
if not is_answering:
print(delta.reasoning_content, end="", flush=True)
reasoning_content += delta.reasoning_content
# Received content, start responding
if hasattr(delta, "content") and delta.content:
if not is_answering:
print("\n" + "=" * 20 + "Full response" + "=" * 20 + "\n")
is_answering = True
print(delta.content, end="", flush=True)
answer_content += delta.content
Response
====================Thought process====================
Okay, the user asked "Who are you," so I need to give an accurate and friendly answer. First, I should confirm my identity as Qwen, developed by Tongyi Lab under Alibaba Group. Next, explain my main functions, like answering questions, creating text, logical reasoning, etc. Keep the tone approachable and avoid overly technical terms so the user feels comfortable. Also, avoid complex jargon and ensure the answer is concise. Additionally, include some interactive elements to encourage further questions. Finally, check for any missing key information, such as my Chinese name "Tongyi Qianwen" and English name "Qwen," along with my company and lab. Make sure the response is comprehensive and meets user expectations.
====================Full response====================
Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can answer questions, create text, perform logical reasoning, programming, and more, aiming to provide high-quality information and services. You can call me Qwen or simply Tongyi Qianwen. How can I help you?
Node.js
Example code
import OpenAI from "openai";
import process from 'process';
// Initialize the openai client
const openai = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY, // Read from environment variable
baseURL: 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1'
});
let reasoningContent = '';
let answerContent = '';
let isAnswering = false;
async function main() {
try {
const messages = [{ role: 'user', content: 'Who are you' }];
const stream = await openai.chat.completions.create({
// Replace with other Qwen3 models or QwQ models as needed
model: 'qwen-plus',
messages,
stream: true,
// The enable_thinking parameter enables the thinking process. This parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, and QwQ.
enable_thinking: true
});
console.log('\n' + '='.repeat(20) + 'Thought process' + '='.repeat(20) + '\n');
for await (const chunk of stream) {
if (!chunk.choices?.length) {
console.log('\nUsage:');
console.log(chunk.usage);
continue;
}
const delta = chunk.choices[0].delta;
// Collect only thinking content
if (delta.reasoning_content !== undefined && delta.reasoning_content !== null) {
if (!isAnswering) {
process.stdout.write(delta.reasoning_content);
}
reasoningContent += delta.reasoning_content;
}
// Received content, start responding
if (delta.content !== undefined && delta.content) {
if (!isAnswering) {
console.log('\n' + '='.repeat(20) + 'Full response' + '='.repeat(20) + '\n');
isAnswering = true;
}
process.stdout.write(delta.content);
answerContent += delta.content;
}
}
} catch (error) {
console.error('Error:', error);
}
}
main();
Response
====================Thought process====================
Okay, the user asked "Who are you," so I need to state my identity. First, I should clearly say I am Qwen, a large-scale language model developed by Alibaba Cloud. Next, mention my main functions, like answering questions, creating text, logical reasoning, etc. Also emphasize my multilingual support, including Chinese and English, so users know I can handle requests in different languages. Additionally, explain my application scenarios, such as helping with learning, work, and daily life. However, since the user's question is direct, detailed information might not be necessary—keep it concise. Also, ensure a friendly tone and invite further questions. Check for any missing key information, like my version or latest updates, but the user probably doesn't need that much detail. Finally, confirm the response is accurate and error-free.
====================Full response====================
I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can handle various tasks like answering questions, creating text, logical reasoning, and programming, supporting multiple languages including Chinese and English. If you have any questions or need help, feel free to tell me anytime!
HTTP
Example code
curl
For Qwen3 open-source models, set enable_thinking to true to enable thinking mode. The enable_thinking parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, QwQ .
curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-plus",
"messages": [
{
"role": "user",
"content": "Who are you"
}
],
"stream": true,
"stream_options": {
"include_usage": true
},
"enable_thinking": true
}'
Response
data: {"choices":[{"delta":{"content":null,"role":"assistant","reasoning_content":""},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1745485391,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-e2edaf2c-8aaf-9e54-90e2-b21dd5045503"}
.....
data: {"choices":[{"finish_reason":"stop","delta":{"content":"","reasoning_content":null},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1745485391,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-e2edaf2c-8aaf-9e54-90e2-b21dd5045503"}
data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":10,"completion_tokens":360,"total_tokens":370},"created":1745485391,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-e2edaf2c-8aaf-9e54-90e2-b21dd5045503"}
data: [DONE]
DashScope
Below is the streaming response format when calling the thinking mode of the qwen-plus model using the DashScope Python SDK:
# Thinking stage
...
{"role": "assistant", "content": "", "reasoning_content": "High information density,"}
{"role": "assistant", "content": "", "reasoning_content": "making users feel helped."}
# Response stage
{"role": "assistant", "content": "I am Qwen", "reasoning_content": ""}
{"role": "assistant", "content": ", developed by Tongyi Lab", "reasoning_content": ""}
...
-
If
reasoning_contentis not "", andcontentis "", the current stage is thinking. -
If
reasoning_contentis "", andcontentis not "", the current stage is responding. -
If both are "", the stage remains the same as the previous packet.
Python
Example code
import os
from dashscope import Generation
import dashscope
dashscope.base_http_api_url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/"
messages = [{"role": "user", "content": "Who are you?"}]
completion = Generation.call(
# If you haven't configured an environment variable, replace the next line with your Alibaba Cloud Model Studio API key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# Replace with other deep-thinking models as needed
model="qwen-plus",
messages=messages,
result_format="message", # Qwen3 open-source models only support "message"; for better experience, we recommend setting this to "message" for other models too.
# Enable deep thinking. This parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, and QwQ.
enable_thinking=True,
stream=True,
incremental_output=True, # Qwen3 open-source models only support true; for better experience, we recommend setting this to true for other models too.
)
# Define full thought process
reasoning_content = ""
# Define full response
answer_content = ""
# Determine if finished thinking and started responding
is_answering = False
print("=" * 20 + "Thought process" + "=" * 20)
for chunk in completion:
# If both thought process and response are empty, skip
if (
chunk.output.choices[0].message.content == ""
and chunk.output.choices[0].message.reasoning_content == ""
):
pass
else:
# If currently in thinking stage
if (
chunk.output.choices[0].message.reasoning_content != ""
and chunk.output.choices[0].message.content == ""
):
print(chunk.output.choices[0].message.reasoning_content, end="", flush=True)
reasoning_content += chunk.output.choices[0].message.reasoning_content
# If currently in response stage
elif chunk.output.choices[0].message.content != "":
if not is_answering:
print("\n" + "=" * 20 + "Full response" + "=" * 20)
is_answering = True
print(chunk.output.choices[0].message.content, end="", flush=True)
answer_content += chunk.output.choices[0].message.content
# To print the full thought process and full response, uncomment the following lines and run
# print("=" * 20 + "Full thought process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Full response" + "=" * 20 + "\n")
# print(f"{answer_content}")
Response
====================Thought process====================
Okay, the user asked: "Who are you?" I need to answer this question. First, clarify my identity as Qwen, a large-scale language model developed by Alibaba Cloud. Next, explain my functions and purposes, like answering questions, creating text, logical reasoning, etc. Also, emphasize my goal of being a helpful assistant.
Keep the expression conversational, avoiding professional jargon or complex sentence structures. Add friendly phrases like "Hello there~" to make the conversation natural. Also, ensure accuracy and don't omit key points like my developer, main functions, and usage scenarios.
Consider possible follow-up questions from the user, such as specific application examples or technical details, so subtly hint at further inquiries in the response. For example, mention "Whether it's everyday questions or professional issues, I'll do my best to help," which is both comprehensive and open-ended.
Finally, check for fluency, repetition, or redundant information to keep the response concise. Maintain a balance between friendliness and professionalism so users feel both approachable and reliable.
====================Full response====================
Hello there~ I'm Qwen, a large-scale language model developed by Alibaba Cloud. I can answer questions, create text, perform logical reasoning, programming, and more, aiming to provide help and support. Whether it's everyday questions or professional issues, I'll do my best to help. How can I assist you?
Java
Example code
// dashscope SDK version >= 2.19.4
import java.util.Arrays;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";
}
private static final Logger logger = LoggerFactory.getLogger(Main.class);
private static StringBuilder reasoningContent = new StringBuilder();
private static StringBuilder finalContent = new StringBuilder();
private static boolean isFirstPrint = true;
private static void handleGenerationResult(GenerationResult message) {
String reasoning = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
String content = message.getOutput().getChoices().get(0).getMessage().getContent();
if (!reasoning.isEmpty()) {
reasoningContent.append(reasoning);
if (isFirstPrint) {
System.out.println("====================Thought process====================");
isFirstPrint = false;
}
System.out.print(reasoning);
}
if (!content.isEmpty()) {
finalContent.append(content);
if (!isFirstPrint) {
System.out.println("\n====================Full response====================");
isFirstPrint = true;
}
System.out.print(content);
}
}
private static GenerationParam buildGenerationParam(Message userMsg) {
return GenerationParam.builder()
// If you haven't configured an environment variable, replace the next line with your Alibaba Cloud Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-plus")
.enableThinking(true)
.incrementalOutput(true)
.resultFormat("message")
.messages(Arrays.asList(userMsg))
.build();
}
public static void streamCallWithMessage(Generation gen, Message userMsg)
throws NoApiKeyException, ApiException, InputRequiredException {
GenerationParam param = buildGenerationParam(userMsg);
Flowable<GenerationResult> result = gen.streamCall(param);
result.blockingForEach(message -> handleGenerationResult(message));
}
public static void main(String[] args) {
try {
Generation gen = new Generation();
Message userMsg = Message.builder().role(Role.USER.getValue()).content("Who are you?").build();
streamCallWithMessage(gen, userMsg);
// Print final result
// if (reasoningContent.length() > 0) {
// System.out.println("\n====================Full response====================");
// System.out.println(finalContent.toString());
// }
} catch (ApiException | NoApiKeyException | InputRequiredException e) {
logger.error("An exception occurred: {}", e.getMessage());
}
System.exit(0);
}
}
Response
====================Thought process====================
Okay, the user asked "Who are you?", so I need to answer based on previous settings. First, my role is Qwen, a large-scale language model under Alibaba Group. Keep it conversational and simple.
The user might be new to me or confirming my identity. Start by directly stating who I am, then briefly explain my functions and purposes, like answering questions, creating text, programming, etc. Also mention multilingual support so users know I handle different languages.
Also, per guidelines, maintain human-like qualities, so use a friendly tone, maybe add emojis for warmth. Guide users to ask further questions or use my features, like asking what they need help with.
Avoid complex terms and keep it concise. Check for missing key points like multilingual support and specific capabilities. Ensure the response meets all requirements, including conversational style and simplicity.
====================Full response====================
Hello! I'm Qwen, a large-scale language model under Alibaba Group. I can answer questions, create text like stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I'm proficient in multiple languages, including but not limited to Chinese, English, German, French, and Spanish. How can I help you?
HTTP
Example code
curl
For hybrid thinking models, set enable_thinking to true to enable thinking mode. The enable_thinking parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, QwQ .
# ======= Important notes =======
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
# === Delete this comment before running ===
curl -X POST "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/text-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
"model": "qwen-plus",
"input":{
"messages":[
{
"role": "user",
"content": "Who are you?"
}
]
},
"parameters":{
"enable_thinking": true,
"incremental_output": true,
"result_format": "message"
}
}'
Response
id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"Hmm","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":14,"input_tokens":11,"output_tokens":3},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":",","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":15,"input_tokens":11,"output_tokens":4},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:3
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"user","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":16,"input_tokens":11,"output_tokens":5},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:4
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"asked","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":17,"input_tokens":11,"output_tokens":6},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:5
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"\"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":18,"input_tokens":11,"output_tokens":7},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
......
id:358
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"help","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":373,"input_tokens":11,"output_tokens":362},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:359
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":",","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":374,"input_tokens":11,"output_tokens":363},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:360
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"welcome","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":375,"input_tokens":11,"output_tokens":364},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:361
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"anytime","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":376,"input_tokens":11,"output_tokens":365},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:362
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"tell","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":377,"input_tokens":11,"output_tokens":366},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:363
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"me","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":378,"input_tokens":11,"output_tokens":367},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:364
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"!","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":378,"input_tokens":11,"output_tokens":367},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
id:365
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"","role":"assistant"},"finish_reason":"stop"}]},"usage":{"total_tokens":378,"input_tokens":11,"output_tokens":367},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
Going live
-
Performance and resource management: In backend services, maintaining an HTTP persistent connection for each streaming request consumes resources. Configure your service with appropriate connection pool size and timeout values. In high concurrency scenarios, monitor file descriptor usage to prevent exhaustion.
-
Client-side rendering: On web frontends, use the
ReadableStreamandTextDecoderStreamAPIs to smoothly handle and render SSE event streams, delivering the best user experience. -
-
Key metrics: Monitor Time to First Token (TTFT), the core metric for streaming experience. Also monitor API error rate and average response time.
-
Alerting: Set alerts for abnormal API error rates, especially 4xx and 5xx errors.
-
-
Nginx proxy configuration: If using Nginx as a reverse proxy, its default output buffering (proxy_buffering) breaks the real-time nature of streaming responses. To ensure data is pushed to clients immediately, disable this feature by setting
proxy_buffering offin your Nginx configuration file.
Error codes
If the model call fails and returns an error message, see Error codes for resolution.
FAQ
Q: Why is there no usage information in the response?
A: The OpenAI protocol does not return usage information by default. Set the stream_options parameter to include usage information in the final returned packet.
Q: Does enabling streaming output affect the model's response quality?
A: No. However, some models support only streaming output, and non-streaming calls might cause timeout errors. We recommend using streaming output.
Q: What is the difference between non-streaming and streaming calls?
A: Key differences:
-
Timeout limit: Non-streaming calls have a fixed maximum timeout of 300 seconds. If the model does not finish generating within 300 seconds, the request times out and fails.
-
Output structure: Non-streaming calls return the complete response (a single JSON object) at once. Streaming calls return data chunks progressively via the SSE protocol, with each chunk containing part of the generated content. The client must assemble these chunks.
-
Feature compatibility: Both support features like JSON Mode and Function Call with no functional differences.
We recommend using streaming output to avoid timeouts and improve user experience.
Q: Does streaming output support JSON Mode (structured output)?
A: Yes. Set stream to true and response_format to {"type": "json_object"} in the request. The model will return JSON-formatted content fragments progressively. The final assembled output will be valid JSON.