All Products
Search
Document Center

Alibaba Cloud Model Studio:Streaming output

Last Updated:Jun 17, 2026

In real-time chat or long-text generation applications, long wait times degrade user experience and may trigger server-side timeouts, causing tasks to fail. Streaming output addresses these issues by continuously returning fragments of text as the model generates them.

How it works

Streaming output uses the Server-Sent Events (SSE) protocol. After a streaming request starts, the server establishes an HTTP persistent connection with the client. Each time the model generates a text block (called a chunk), it immediately pushes it through this connection. Once all content is generated, the server sends an end signal.

The client listens to the event stream and receives and processes text chunks in real time—for example, rendering characters one by one on the interface. This contrasts with non-streaming calls, which return all content at once.

⏱️ Wait time: 3 seconds
Streaming output disabled
The components above are for reference only and do not send actual requests.

Billing

Streaming output uses the same billing rule as non-streaming calls, charging based on the number of input tokens and output tokens in the request.

If a request is interrupted, output tokens are counted only for the portion generated before the server received the termination request.

How to use

Important

Qwen3 open-source edition, QwQ commercial and open-source editions, QVQ, and Qwen-Omni support only streaming output.

Step 1: Configure your API key and select a region

You must have obtained an API key and configured it as an environment variable.

Configuring your API key as an environment variable (DASHSCOPE_API_KEY) is more secure than hard coding it in your code.

Step 2: Make a streaming request

OpenAI compatible

  • How to enable

    Set stream to true.

  • View token usage

    The OpenAI protocol does not return token usage by default. Set stream_options={"include_usage": true} so the last returned data chunk includes token usage information.

Python

import os
from openai import OpenAI

# 1. Prepare: Initialize the client
client = OpenAI(
    # Configure the API key using an environment variable to avoid hard coding.
    api_key=os.environ["DASHSCOPE_API_KEY"],
    # The API key is tightly bound to a region. Ensure base_url matches the region of your API key.
    # Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
    base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)

# 2. Make a streaming request
completion = client.chat.completions.create(
    model="qwen-plus",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Please introduce yourself"}
    ],
    stream=True,
    stream_options={"include_usage": True}
)

# 3. Handle the streaming response
# Store response fragments in a list. Joining them at the end is more efficient than repeated string concatenation.
content_parts = []
print("AI: ", end="", flush=True)

for chunk in completion:
    if chunk.choices:
        content = chunk.choices[0].delta.content or ""
        print(content, end="", flush=True)
        content_parts.append(content)
    elif chunk.usage:
        print("\n--- Request usage ---")
        print(f"Input Tokens: {chunk.usage.prompt_tokens}")
        print(f"Output Tokens: {chunk.usage.completion_tokens}")
        print(f"Total Tokens: {chunk.usage.total_tokens}")

full_response = "".join(content_parts)
# print(f"\n--- Full response ---\n{full_response}")

Response

AI: Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can answer questions, create content such as stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I support multiple languages, including but not limited to Chinese, English, German, French, and Spanish. If you have any questions or need help, feel free to ask me anytime!
--- Request usage ---
Input Tokens: 26
Output Tokens: 87
Total Tokens: 113

Node.js

import OpenAI from "openai";

async function main() {
    // 1. Prepare: Initialize the client
    // Configure the API key using an environment variable to avoid hard coding.
    if (!process.env.DASHSCOPE_API_KEY) {
        throw new Error("Set the DASHSCOPE_API_KEY environment variable");
    }
    // The API key is tightly bound to a region. Ensure baseURL matches the region of your API key.
    // Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
    const client = new OpenAI({
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
    });

    try {
        // 2. Make a streaming request
        const stream = await client.chat.completions.create({
            model: "qwen-plus",
            messages: [
                { role: "system", content: "You are a helpful assistant." },
                { role: "user", content: "Please introduce yourself" },
            ],
            stream: true,
            // Purpose: Get token usage in the last chunk.
            stream_options: { include_usage: true },
        });

        // 3. Handle the streaming response
        const contentParts = [];
        process.stdout.write("AI: ");
        
        for await (const chunk of stream) {
            // The last chunk contains no choices but includes usage information.
            if (chunk.choices && chunk.choices.length > 0) {
                const content = chunk.choices[0]?.delta?.content || "";
                process.stdout.write(content);
                contentParts.push(content);
            } else if (chunk.usage) {
                // Request complete. Print token usage.
                console.log("\n--- Request usage ---");
                console.log(`Input Tokens: ${chunk.usage.prompt_tokens}`);
                console.log(`Output Tokens: ${chunk.usage.completion_tokens}`);
                console.log(`Total Tokens: ${chunk.usage.total_tokens}`);
            }
        }
        
        const fullResponse = contentParts.join("");
        // console.log(`\n--- Full response ---\n${fullResponse}`);

    } catch (error) {
        console.error("Request failed:", error);
    }
}

main();

Response

AI: Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can answer questions, create content such as stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I support multiple languages, including but not limited to Chinese, English, German, French, and Spanish. If you have any questions or need help, feel free to ask me anytime!
--- Request usage ---
Input Tokens: 26
Output Tokens: 89
Total Tokens: 115

curl

Request

# ======= Important notes =======
# Ensure the DASHSCOPE_API_KEY environment variable is set
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key

# === Delete this comment before running ===

curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
--no-buffer \
-d '{
    "model": "qwen-plus",
    "messages": [
        {"role": "user", "content": "Who are you?"}
    ],
    "stream": true,
    "stream_options": {"include_usage": true}
}'

Response

The response follows the SSE protocol. Each line starting with data: represents a data chunk.

data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: {"choices":[{"finish_reason":null,"delta":{"content":"I am"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: {"choices":[{"delta":{"content":" from"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: {"choices":[{"delta":{"content":" Alibaba"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: {"choices":[{"delta":{"content":"'s large-scale language"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: {"choices":[{"delta":{"content":" model, my name is Qwen"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: {"choices":[{"delta":{"content":"."},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: {"choices":[{"finish_reason":"stop","delta":{"content":""},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":22,"completion_tokens":17,"total_tokens":39},"created":1726132850,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-428b414f-fdd4-94c6-b179-8f576ad653a8"}

data: [DONE]
  • data:: The message payload, usually a JSON string.

  • [DONE]: Indicates the end of the entire streaming response.

DashScope

  • How to enable

    The method to enable streaming output varies by SDK or tool:

    • Python SDK: Set the stream parameter to True.

    • Java SDK: Use the streamCall interface.

    • cURL: Set the header X-DashScope-SSE to enable.

  • Enable incremental output

    The DashScope protocol supports both incremental and non-incremental streaming output:

    • Incremental (recommended): Each data chunk contains only newly generated content. Set incremental_output to true to enable incremental streaming output.

      Example: ["I love","eating","apples"]
    • Non-incremental: Each data chunk contains all previously generated content, wasting network bandwidth and increasing client processing load. Set incremental_output to false to enable non-incremental streaming output.

      Example: ["I love","I love eating","I love eating apples"]
  • View token usage

    Each data chunk includes real-time token usage information.

Python

import os
from http import HTTPStatus
import dashscope
from dashscope import Generation

# 1. Prepare: Configure the API key and region
# Configure the API key using an environment variable to avoid hard coding.
try:
    dashscope.api_key = os.environ["DASHSCOPE_API_KEY"]
except KeyError:
    raise ValueError("Set the DASHSCOPE_API_KEY environment variable")

# The API key is tightly bound to a region. Ensure base_url matches the region of your API key.

dashscope.base_http_api_url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1"

# 2. Make a streaming request
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please introduce yourself"},
]

try:
    responses = Generation.call(
        model="qwen-plus",
        messages=messages,
        result_format="message",
        stream=True,
        # Key: Set to True for incremental output, which offers better performance.
        incremental_output=True,
    )

    # 3. Handle the streaming response
    content_parts = []
    print("AI: ", end="", flush=True)

    for resp in responses:
        if resp.status_code == HTTPStatus.OK:
            content = resp.output.choices[0].message.content
            print(content, end="", flush=True)
            content_parts.append(content)

            # Check if this is the last packet
            if resp.output.choices[0].finish_reason == "stop":
                usage = resp.usage
                print("\n--- Request usage ---")
                print(f"Input Tokens: {usage.input_tokens}")
                print(f"Output Tokens: {usage.output_tokens}")
                print(f"Total Tokens: {usage.total_tokens}")
        else:
            # Handle errors
            print(
                f"\nRequest failed: request_id={resp.request_id}, code={resp.code}, message={resp.message}"
            )
            break

    full_response = "".join(content_parts)
    # print(f"\n--- Full response ---\n{full_response}")

except Exception as e:
    print(f"An unknown error occurred: {e}")

Response

AI: Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can help you answer questions, create content such as stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I support multiple languages, including but not limited to Chinese, English, German, French, and Spanish. If you have any questions or need help, feel free to ask me anytime!
--- Request usage ---
Input Tokens: 26
Output Tokens: 91
Total Tokens: 117

Java

import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.Role;
import io.reactivex.Flowable;
import io.reactivex.schedulers.Schedulers;

import java.util.Arrays;
import java.util.concurrent.CountDownLatch;
import com.alibaba.dashscope.protocol.Protocol;

public class Main {
    public static void main(String[] args) {
        // 1. Get the API key
        String apiKey = System.getenv("DASHSCOPE_API_KEY");
        if (apiKey == null || apiKey.isEmpty()) {
            System.err.println("Set the DASHSCOPE_API_KEY environment variable");
            return;
        }

        // 2. Initialize the Generation instance
        // The API key is tightly bound to a region. Ensure baseUrl matches the region of your API key.

        Generation gen = new Generation(Protocol.HTTP.getValue(), "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1");
        CountDownLatch latch = new CountDownLatch(1);

        // 3. Build request parameters
        GenerationParam param = GenerationParam.builder()
                .apiKey(apiKey)
                .model("qwen-plus")
                .messages(Arrays.asList(
                        Message.builder()
                                .role(Role.USER.getValue())
                                .content("Introduce yourself")
                                .build()
                ))
                .resultFormat(GenerationParam.ResultFormat.MESSAGE)
                .incrementalOutput(true) // Enable incremental output for streaming
                .build();
        // 4. Make a streaming call and handle the response
        try {
            Flowable<GenerationResult> result = gen.streamCall(param);
            StringBuilder fullContent = new StringBuilder();
            System.out.print("AI: ");
            result
                    .subscribeOn(Schedulers.io()) // Execute request on IO thread
                    .observeOn(Schedulers.computation()) // Process response on computation thread
                    .subscribe(
                            // onNext: Handle each response fragment
                            message -> {
                                String content = message.getOutput().getChoices().get(0).getMessage().getContent();
                                String finishReason = message.getOutput().getChoices().get(0).getFinishReason();
                                // Output content
                                System.out.print(content);
                                fullContent.append(content);
                                // When finishReason is not null, it indicates the last chunk. Output usage info.
                                if (finishReason != null && !"null".equals(finishReason)) {
                                    System.out.println("\n--- Request usage ---");
                                    System.out.println("Input Tokens: " + message.getUsage().getInputTokens());
                                    System.out.println("Output Tokens: " + message.getUsage().getOutputTokens());
                                    System.out.println("Total Tokens: " + message.getUsage().getTotalTokens());
                                }
                                System.out.flush(); // Flush output immediately
                            },
                            // onError: Handle errors
                            error -> {
                                System.err.println("\nRequest failed: " + error.getMessage());
                                latch.countDown();
                            },
                            // onComplete: Completion callback
                            () -> {
                                System.out.println(); // New line
                                // System.out.println("Full response: " + fullContent.toString());
                                latch.countDown();
                            }
                    );
            // Main thread waits for the async task to complete
            latch.await();
            System.out.println("Program execution complete");
        } catch (Exception e) {
            System.err.println("Request exception: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Response

AI: Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can help you answer questions, create content such as stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I support multiple languages, including but not limited to Chinese, English, German, French, and Spanish. If you have any questions or need help, feel free to ask me anytime!
--- Request usage ---
Input Tokens: 26
Output Tokens: 91
Total Tokens: 117

curl

Request

# ======= Important notes =======
# Ensure the DASHSCOPE_API_KEY environment variable is set
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
# === Delete this comment before running ===
curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/text-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
    "model": "qwen-plus",
    "input":{
        "messages":[      
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Who are you?"
            }
        ]
    },
    "parameters": {
        "result_format": "message",
        "incremental_output":true
    }
}'

Response

The response follows the Server-Sent Events (SSE) format. Each message includes:

  • id: Data chunk number.

  • event: Event type, always "result".

  • HTTP status code information.

  • data: JSON-formatted data.

id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"I am","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":27,"output_tokens":1,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"Qwen","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":30,"output_tokens":4,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}

id:3
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":" from Alibaba","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":33,"output_tokens":7,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}

...

id:13
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"or need help, feel free to","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":90,"output_tokens":64,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}

id:14
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"ask me!","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":92,"output_tokens":66,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}

id:15
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","role":"assistant"},"finish_reason":"stop"}]},"usage":{"total_tokens":92,"output_tokens":66,"input_tokens":26,"prompt_tokens_details":{"cached_tokens":0}},"request_id":"d30a9914-ac97-9102-b746-ce0cb35e3fa2"}

Streaming output for multimodal models

Multimodal models support adding images, audio, and other content to conversations. Their streaming output implementation differs from text-only models in the following ways:

  • User message construction: Multimodal model inputs include not only text but also images, audio, and other multimodal information.

  • DashScope SDK interface: Use the MultiModalConversation interface in the DashScope Python SDK. Use the MultiModalConversation class in the DashScope Java SDK.

For multimodal models, see Image and video understanding, Text extraction, Audio understanding—Qwen3-Omni-Captioner, Kimi, etc. The Qwen-Omni model supports only streaming output because its output can include text or audio and other multimodal content. Its result parsing differs from other models. For details, see Omni-modal.

OpenAI compatible

Python

from openai import OpenAI
import os

client = OpenAI(
    # API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If you haven't configured an environment variable, replace the next line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
    base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-vl-plus",  # Replace with other multimodal models as needed and adjust messages accordingly
    messages=[
        {"role": "user",
        "content": [{"type": "image_url",
                    "image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
                    {"type": "text", "text": "What scene is depicted in the image?"}]}],
    stream=True,
  # stream_options={"include_usage": True}
)
full_content = ""
print("Streaming output content:")
for chunk in completion:
    # If stream_options.include_usage is True, the last chunk's choices field is an empty list and should be skipped (token usage can be obtained via chunk.usage)
    if chunk.choices and chunk.choices[0].delta.content != "":
        full_content += chunk.choices[0].delta.content
        print(chunk.choices[0].delta.content)
print(f"Full content: {full_content}")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
        // If you haven't configured an environment variable, replace the next line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
        baseURL: "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1"
    }
);

const completion = await openai.chat.completions.create({
    model: "qwen3-vl-plus",  //  Replace with other multimodal models as needed and adjust messages accordingly
    messages: [
        {role: "user",
        content: [{"type": "image_url",
                    "image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
                    {"type": "text", "text": "What scene is depicted in the image?"}]}],
    stream: true,
    // stream_options: { include_usage: true },
});

let fullContent = ""
console.log("Streaming output content:")
for await (const chunk of completion) {
    // If stream_options.include_usage is true, the last chunk's choices field is an empty array and should be skipped (token usage can be obtained via chunk.usage)
    if (chunk.choices[0] && chunk.choices[0].delta.content != null) {
      fullContent += chunk.choices[0].delta.content;
      console.log(chunk.choices[0].delta.content);
    }
}
console.log(`Full output content: ${fullContent}`)

curl

# ======= Important notes =======
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Delete this comment before running ===

curl --location 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3-vl-plus",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
          }
        },
        {
          "type": "text",
          "text": "What scene is depicted in the image?"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true}
}'

DashScope

Python

import os
from dashscope import MultiModalConversation
import dashscope
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
dashscope.base_http_api_url = 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
            {"text": "What scene is depicted in the image?"}
        ]
    }
]

responses = MultiModalConversation.call(
    # API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If you haven't configured an environment variable, replace the next line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen3-vl-plus',  #  Replace with other multimodal models as needed and adjust messages accordingly
    messages=messages,
    stream=True,
    incremental_output=True)
    
full_content = ""
print("Streaming output content:")
for response in responses:
    if response["output"]["choices"][0]["message"].content:
        print(response.output.choices[0].message.content[0]['text'])
        full_content += response.output.choices[0].message.content[0]['text']
print(f"Full content: {full_content}")

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
        Constants.baseHttpApiUrl="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";
    }
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // must create mutable map.
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "What scene is depicted in the image?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // If you haven't configured an environment variable, replace the next line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  //  Replace with other multimodal models as needed and adjust messages accordingly
                .messages(Arrays.asList(userMessage))
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                List<Map<String, Object>> content = item.getOutput().getChoices().get(0).getMessage().getContent();
                    // Check if content exists and is not empty
                if (content != null &&  !content.isEmpty()) {
                    System.out.println(content.get(0).get("text"));
                    }
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important notes =======
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
# === Delete this comment before running ===

curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                    {"text": "What scene is depicted in the image?"}
                ]
            }
        ]
    },
    "parameters": {
        "incremental_output": true
    }
}'

Streaming output for thinking models

Thinking models first return reasoning_content (the thought process), then return content (the response). Determine whether the current stage is thinking or responding based on the data packet status.

For details on thinking models, see Deep thinking, Image and video understanding, Visual reasoning.
For streaming output implementation of Qwen3-Omni-Flash (thinking mode), see Omni-modal.

OpenAI compatible

Below is the response format when calling the thinking mode of the qwen-plus model using the OpenAI Python SDK in streaming mode:

# Thinking stage
...
ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=None, reasoning_content='Cover all key points while')
ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=None, reasoning_content='remaining natural and fluent.')
# Response stage
ChoiceDelta(content='Hello! I am **Qwen', function_call=None, refusal=None, role=None, tool_calls=None, reasoning_content=None)
ChoiceDelta(content='** (', function_call=None, refusal=None, role=None, tool_calls=None, reasoning_content=None)
...
  • If reasoning_content is not None and content is None, the current stage is thinking.

  • If reasoning_content is None and content is not None, the current stage is responding.

  • If both are None, the stage remains the same as the previous packet.

Python

Example code

from openai import OpenAI
import os

# Initialize the OpenAI client
client = OpenAI(
    # If you haven't configured an environment variable, replace with your Alibaba Cloud Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)

messages = [{"role": "user", "content": "Who are you"}]

completion = client.chat.completions.create(
    model="qwen-plus",  # Replace with other deep-thinking models as needed
    messages=messages,
    # The enable_thinking parameter enables the thinking process. This parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, and QwQ.
    extra_body={"enable_thinking": True},
    stream=True,
    # stream_options={
    #     "include_usage": True
    # },
)

reasoning_content = ""  # Full thought process
answer_content = ""  # Full response
is_answering = False  # Whether in the response stage
print("\n" + "=" * 20 + "Thought process" + "=" * 20 + "\n")

for chunk in completion:
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
        continue

    delta = chunk.choices[0].delta

    # Collect only thinking content
    if hasattr(delta, "reasoning_content") and delta.reasoning_content is not None:
        if not is_answering:
            print(delta.reasoning_content, end="", flush=True)
        reasoning_content += delta.reasoning_content

    # Received content, start responding
    if hasattr(delta, "content") and delta.content:
        if not is_answering:
            print("\n" + "=" * 20 + "Full response" + "=" * 20 + "\n")
            is_answering = True
        print(delta.content, end="", flush=True)
        answer_content += delta.content

Response

====================Thought process====================

Okay, the user asked "Who are you," so I need to give an accurate and friendly answer. First, I should confirm my identity as Qwen, developed by Tongyi Lab under Alibaba Group. Next, explain my main functions, like answering questions, creating text, logical reasoning, etc. Keep the tone approachable and avoid overly technical terms so the user feels comfortable. Also, avoid complex jargon and ensure the answer is concise. Additionally, include some interactive elements to encourage further questions. Finally, check for any missing key information, such as my Chinese name "Tongyi Qianwen" and English name "Qwen," along with my company and lab. Make sure the response is comprehensive and meets user expectations.
====================Full response====================

Hello! I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can answer questions, create text, perform logical reasoning, programming, and more, aiming to provide high-quality information and services. You can call me Qwen or simply Tongyi Qianwen. How can I help you?

Node.js

Example code

import OpenAI from "openai";
import process from 'process';

// Initialize the openai client
const openai = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY, // Read from environment variable
    baseURL: 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1'
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;

async function main() {
    try {
        const messages = [{ role: 'user', content: 'Who are you' }];
        const stream = await openai.chat.completions.create({
            // Replace with other Qwen3 models or QwQ models as needed
            model: 'qwen-plus',
            messages,
            stream: true,
            // The enable_thinking parameter enables the thinking process. This parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, and QwQ.
            enable_thinking: true
        });
        console.log('\n' + '='.repeat(20) + 'Thought process' + '='.repeat(20) + '\n');

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;
            
            // Collect only thinking content
            if (delta.reasoning_content !== undefined && delta.reasoning_content !== null) {
                if (!isAnswering) {
                    process.stdout.write(delta.reasoning_content);
                }
                reasoningContent += delta.reasoning_content;
            }

            // Received content, start responding
            if (delta.content !== undefined && delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Full response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();

Response

====================Thought process====================

Okay, the user asked "Who are you," so I need to state my identity. First, I should clearly say I am Qwen, a large-scale language model developed by Alibaba Cloud. Next, mention my main functions, like answering questions, creating text, logical reasoning, etc. Also emphasize my multilingual support, including Chinese and English, so users know I can handle requests in different languages. Additionally, explain my application scenarios, such as helping with learning, work, and daily life. However, since the user's question is direct, detailed information might not be necessary—keep it concise. Also, ensure a friendly tone and invite further questions. Check for any missing key information, like my version or latest updates, but the user probably doesn't need that much detail. Finally, confirm the response is accurate and error-free.
====================Full response====================

I am Qwen, a large-scale language model independently developed by Tongyi Lab under Alibaba Group. I can handle various tasks like answering questions, creating text, logical reasoning, and programming, supporting multiple languages including Chinese and English. If you have any questions or need help, feel free to tell me anytime!

HTTP

Example code

curl

For Qwen3 open-source models, set enable_thinking to true to enable thinking mode. The enable_thinking parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, QwQ .

curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen-plus",
    "messages": [
        {
            "role": "user", 
            "content": "Who are you"
        }
    ],
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "enable_thinking": true
}'

Response

data: {"choices":[{"delta":{"content":null,"role":"assistant","reasoning_content":""},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1745485391,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-e2edaf2c-8aaf-9e54-90e2-b21dd5045503"}

.....

data: {"choices":[{"finish_reason":"stop","delta":{"content":"","reasoning_content":null},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1745485391,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-e2edaf2c-8aaf-9e54-90e2-b21dd5045503"}

data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":10,"completion_tokens":360,"total_tokens":370},"created":1745485391,"system_fingerprint":null,"model":"qwen-plus","id":"chatcmpl-e2edaf2c-8aaf-9e54-90e2-b21dd5045503"}

data: [DONE]

DashScope

Below is the streaming response format when calling the thinking mode of the qwen-plus model using the DashScope Python SDK:

# Thinking stage
...
{"role": "assistant", "content": "", "reasoning_content": "High information density,"}
{"role": "assistant", "content": "", "reasoning_content": "making users feel helped."}
# Response stage
{"role": "assistant", "content": "I am Qwen", "reasoning_content": ""}
{"role": "assistant", "content": ", developed by Tongyi Lab", "reasoning_content": ""}
...
  • If reasoning_content is not "", and content is "", the current stage is thinking.

  • If reasoning_content is "", and content is not "", the current stage is responding.

  • If both are "", the stage remains the same as the previous packet.

Python

Example code

import os
from dashscope import Generation
import dashscope
dashscope.base_http_api_url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/"

messages = [{"role": "user", "content": "Who are you?"}]

completion = Generation.call(
    # If you haven't configured an environment variable, replace the next line with your Alibaba Cloud Model Studio API key: api_key = "sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Replace with other deep-thinking models as needed
    model="qwen-plus",
    messages=messages,
    result_format="message", # Qwen3 open-source models only support "message"; for better experience, we recommend setting this to "message" for other models too.
    # Enable deep thinking. This parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, and QwQ.
    enable_thinking=True,
    stream=True,
    incremental_output=True, # Qwen3 open-source models only support true; for better experience, we recommend setting this to true for other models too.
)

# Define full thought process
reasoning_content = ""
# Define full response
answer_content = ""
# Determine if finished thinking and started responding
is_answering = False

print("=" * 20 + "Thought process" + "=" * 20)

for chunk in completion:
    # If both thought process and response are empty, skip
    if (
        chunk.output.choices[0].message.content == ""
        and chunk.output.choices[0].message.reasoning_content == ""
    ):
        pass
    else:
        # If currently in thinking stage
        if (
            chunk.output.choices[0].message.reasoning_content != ""
            and chunk.output.choices[0].message.content == ""
        ):
            print(chunk.output.choices[0].message.reasoning_content, end="", flush=True)
            reasoning_content += chunk.output.choices[0].message.reasoning_content
        # If currently in response stage
        elif chunk.output.choices[0].message.content != "":
            if not is_answering:
                print("\n" + "=" * 20 + "Full response" + "=" * 20)
                is_answering = True
            print(chunk.output.choices[0].message.content, end="", flush=True)
            answer_content += chunk.output.choices[0].message.content

# To print the full thought process and full response, uncomment the following lines and run
# print("=" * 20 + "Full thought process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Full response" + "=" * 20 + "\n")
# print(f"{answer_content}")

Response

====================Thought process====================
Okay, the user asked: "Who are you?" I need to answer this question. First, clarify my identity as Qwen, a large-scale language model developed by Alibaba Cloud. Next, explain my functions and purposes, like answering questions, creating text, logical reasoning, etc. Also, emphasize my goal of being a helpful assistant.

Keep the expression conversational, avoiding professional jargon or complex sentence structures. Add friendly phrases like "Hello there~" to make the conversation natural. Also, ensure accuracy and don't omit key points like my developer, main functions, and usage scenarios.

Consider possible follow-up questions from the user, such as specific application examples or technical details, so subtly hint at further inquiries in the response. For example, mention "Whether it's everyday questions or professional issues, I'll do my best to help," which is both comprehensive and open-ended.

Finally, check for fluency, repetition, or redundant information to keep the response concise. Maintain a balance between friendliness and professionalism so users feel both approachable and reliable.
====================Full response====================
Hello there~ I'm Qwen, a large-scale language model developed by Alibaba Cloud. I can answer questions, create text, perform logical reasoning, programming, and more, aiming to provide help and support. Whether it's everyday questions or professional issues, I'll do my best to help. How can I assist you?

Java

Example code

// dashscope SDK version >= 2.19.4
import java.util.Arrays;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        Constants.baseHttpApiUrl="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";
    }
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(GenerationResult message) {
        String reasoning = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String content = message.getOutput().getChoices().get(0).getMessage().getContent();

        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Thought process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (!content.isEmpty()) {
            finalContent.append(content);
            if (!isFirstPrint) {
                System.out.println("\n====================Full response====================");
                isFirstPrint = true;
            }
            System.out.print(content);
        }
    }
    private static GenerationParam buildGenerationParam(Message userMsg) {
        return GenerationParam.builder()
                // If you haven't configured an environment variable, replace the next line with your Alibaba Cloud Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-plus")
                .enableThinking(true)
                .incrementalOutput(true)
                .resultFormat("message")
                .messages(Arrays.asList(userMsg))
                .build();
    }
    public static void streamCallWithMessage(Generation gen, Message userMsg)
            throws NoApiKeyException, ApiException, InputRequiredException {
        GenerationParam param = buildGenerationParam(userMsg);
        Flowable<GenerationResult> result = gen.streamCall(param);
        result.blockingForEach(message -> handleGenerationResult(message));
    }

    public static void main(String[] args) {
        try {
            Generation gen = new Generation();
            Message userMsg = Message.builder().role(Role.USER.getValue()).content("Who are you?").build();
            streamCallWithMessage(gen, userMsg);
//             Print final result
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Full response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }
}

Response

====================Thought process====================
Okay, the user asked "Who are you?", so I need to answer based on previous settings. First, my role is Qwen, a large-scale language model under Alibaba Group. Keep it conversational and simple.

The user might be new to me or confirming my identity. Start by directly stating who I am, then briefly explain my functions and purposes, like answering questions, creating text, programming, etc. Also mention multilingual support so users know I handle different languages.

Also, per guidelines, maintain human-like qualities, so use a friendly tone, maybe add emojis for warmth. Guide users to ask further questions or use my features, like asking what they need help with.

Avoid complex terms and keep it concise. Check for missing key points like multilingual support and specific capabilities. Ensure the response meets all requirements, including conversational style and simplicity.
====================Full response====================
Hello! I'm Qwen, a large-scale language model under Alibaba Group. I can answer questions, create text like stories, official documents, emails, scripts, perform logical reasoning, programming, express opinions, play games, and more. I'm proficient in multiple languages, including but not limited to Chinese, English, German, French, and Spanish. How can I help you?

HTTP

Example code

curl

For hybrid thinking models, set enable_thinking to true to enable thinking mode. The enable_thinking parameter has no effect on models qwen3-30b-a3b-thinking-2507, qwen3-235b-a22b-thinking-2507, QwQ .

# ======= Important notes =======
# API keys differ by region. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# Singapore region URL. Replace {WorkspaceId} with your actual workspace ID. URLs vary by region.
# === Delete this comment before running ===
curl -X POST "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/text-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
    "model": "qwen-plus",
    "input":{
        "messages":[      
            {
                "role": "user",
                "content": "Who are you?"
            }
        ]
    },
    "parameters":{
        "enable_thinking": true,
        "incremental_output": true,
        "result_format": "message"
    }
}'

Response

id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"Hmm","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":14,"input_tokens":11,"output_tokens":3},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":",","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":15,"input_tokens":11,"output_tokens":4},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:3
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"user","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":16,"input_tokens":11,"output_tokens":5},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:4
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"asked","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":17,"input_tokens":11,"output_tokens":6},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:5
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"\"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":18,"input_tokens":11,"output_tokens":7},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}
......

id:358
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"help","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":373,"input_tokens":11,"output_tokens":362},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:359
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":",","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":374,"input_tokens":11,"output_tokens":363},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:360
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"welcome","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":375,"input_tokens":11,"output_tokens":364},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:361
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"anytime","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":376,"input_tokens":11,"output_tokens":365},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:362
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"tell","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":377,"input_tokens":11,"output_tokens":366},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:363
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"me","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":378,"input_tokens":11,"output_tokens":367},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:364
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"!","reasoning_content":"","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":378,"input_tokens":11,"output_tokens":367},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

id:365
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":"","reasoning_content":"","role":"assistant"},"finish_reason":"stop"}]},"usage":{"total_tokens":378,"input_tokens":11,"output_tokens":367},"request_id":"25d58c29-c47b-9e8d-a0f1-d6c309ec58b1"}

Going live

  • Performance and resource management: In backend services, maintaining an HTTP persistent connection for each streaming request consumes resources. Configure your service with appropriate connection pool size and timeout values. In high concurrency scenarios, monitor file descriptor usage to prevent exhaustion.

  • Client-side rendering: On web frontends, use the ReadableStream and TextDecoderStream APIs to smoothly handle and render SSE event streams, delivering the best user experience.

  • Model monitoring:

    • Key metrics: Monitor Time to First Token (TTFT), the core metric for streaming experience. Also monitor API error rate and average response time.

    • Alerting: Set alerts for abnormal API error rates, especially 4xx and 5xx errors.

  • Nginx proxy configuration: If using Nginx as a reverse proxy, its default output buffering (proxy_buffering) breaks the real-time nature of streaming responses. To ensure data is pushed to clients immediately, disable this feature by setting proxy_buffering off in your Nginx configuration file.

Error codes

If the model call fails and returns an error message, see Error codes for resolution.

FAQ

Q: Why is there no usage information in the response?

A: The OpenAI protocol does not return usage information by default. Set the stream_options parameter to include usage information in the final returned packet.

Q: Does enabling streaming output affect the model's response quality?

A: No. However, some models support only streaming output, and non-streaming calls might cause timeout errors. We recommend using streaming output.

Q: What is the difference between non-streaming and streaming calls?

A: Key differences:

  • Timeout limit: Non-streaming calls have a fixed maximum timeout of 300 seconds. If the model does not finish generating within 300 seconds, the request times out and fails.

  • Output structure: Non-streaming calls return the complete response (a single JSON object) at once. Streaming calls return data chunks progressively via the SSE protocol, with each chunk containing part of the generated content. The client must assemble these chunks.

  • Feature compatibility: Both support features like JSON Mode and Function Call with no functional differences.

We recommend using streaming output to avoid timeouts and improve user experience.

Q: Does streaming output support JSON Mode (structured output)?

A: Yes. Set stream to true and response_format to {"type": "json_object"} in the request. The model will return JSON-formatted content fragments progressively. The final assembled output will be valid JSON.