All Products
Search
Document Center

Alibaba Cloud Model Studio:Pemahaman audio (Qwen3-Omni-Captioner)

Last Updated:Jan 06, 2026

Qwen3-Omni-Captioner adalah model open-source yang dibangun di atas Qwen3-Omni. Model ini secara otomatis menghasilkan deskripsi akurat dan komprehensif untuk audio kompleks—termasuk ucapan, suara latar, musik, dan efek suara—tanpa memerlukan prompt. Model ini mampu mengidentifikasi emosi pembicara, elemen musik seperti gaya dan instrumen, serta informasi sensitif. Qwen3-Omni-Captioner ideal untuk analisis konten audio, audit keamanan, pengenalan maksud, dan pengeditan video.

Cakupan

Wilayah yang didukung

  • Singapura: Memerlukan API key dari wilayah ini.

  • Beijing: Memerlukan API key dari wilayah ini.

Model yang didukung

Internasional

Dalam mode penyebaran Internasional, titik akhir dan penyimpanan data berlokasi di wilayah Singapura, sedangkan sumber daya komputasi inferensi dijadwalkan secara dinamis di seluruh dunia (tidak termasuk Daratan Tiongkok).

Nama model

Jendela konteks (token)

Input maksimum (token)

Output maksimum (token)

Biaya input

Biaya Keluaran

Kuota gratis

(Catatan)

(Token)

(Juta Token)

qwen3-omni-30b-a3b-captioner

65.536

32.768

32.768

$3,81

$3,06

1 juta token

Berlaku selama 90 hari setelah Anda mengaktifkan Alibaba Cloud Model Studio

Daratan Tiongkok

Dalam mode penyebaran Daratan Tiongkok, titik akhir dan penyimpanan data keduanya berlokasi di wilayah Beijing, dan sumber daya komputasi inferensi terbatas hanya di Daratan Tiongkok.

Nama model

Jendela konteks (token)

Input maksimum (token)

Output maksimum (token)

Biaya input

Biaya output

Kuota gratis

(Catatan)

(Jumlah token)

(per juta token)

qwen3-omni-30b-a3b-captioner

65.536

32,768

32.768

$2,265

$1,821

Tidak ada kuota gratis

Aturan konversi token untuk audio: Total token = Durasi audio (dalam detik) × 12,5. Jika durasi audio kurang dari satu detik, dihitung sebagai satu detik.

Mulai

Prasyarat

Qwen3-Omni-Captioner hanya mendukung panggilan API. Model ini tidak tersedia untuk pengujian online di Konsol Alibaba Cloud Model Studio.

Contoh kode berikut menunjukkan cara menganalisis audio online yang ditentukan oleh URL. Untuk informasi lebih lanjut, lihat cara mengirimkan file lokal dan batasan pada file audio.

Kompatibel dengan OpenAI

Python

import os
from openai import OpenAI

client = OpenAI(
    # Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"                    
                    }
                }
            ]
        }
    ]
)
print(completion.choices[0].message.content)

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                     }
            }]
        }]
});

console.log(completion.choices[0].message.content)

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= Penting =======
# Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Hapus komentar ini sebelum eksekusi ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "The audio clip is a brief, low-fidelity recording-approximately six seconds long—captured in a small, reverberant indoor space, likely a home office or bedroom. It opens with a rapid, metallic, rhythmic hammering sound, repeating every 0.5 to 0.6 seconds, with each strike slightly uneven and accompanied by a short echo. This sound dominates the left side of the stereo field and is close to the microphone, suggesting the hammering is occurring nearby and slightly to the left.\n\nOverlaid with the hammering, a single male voice speaks in Mandarin Chinese, his tone clearly one of frustration and exasperation. He says, “Oh, with this, how am I supposed to work quietly?”. His speech is clear despite the poor audio quality, and is delivered in a standard, unaccented Mandarin, indicative of a native speaker from northern or central China.\n\nThe voice is more distant and centered in the stereo field, with more room reverberation than the hammering. The emotional content is palpable: his voice rises slightly at the end, turning the phrase into a rhetorical complaint, underscoring his irritation. No other voices, music, or ambient sounds are present; the only non-speech sounds are the hammering and the faint hiss of the recording device.\n\nThe combination of the environmental sound, the speaker’s language, and his tone strongly suggests a scenario of home office disruption—perhaps someone working from home is being disturbed by renovation or repair work happening nearby. The recording ends abruptly, mid-hammer, further emphasizing the spontaneous and candid nature of the capture.\n\nIn summary, the audio is a realistic, low-fidelity snapshot of a Mandarin-speaking man, likely in China, expressing frustration at being unable to work in peace due to nearby construction or repair activity, captured in a personal, indoor setting.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 160,
    "completion_tokens": 387,
    "total_tokens": 547,
    "prompt_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "completion_tokens_details": {
      "text_tokens": 387
    }
  },
  "created": 1758002134,
  "system_fingerprint": null,
  "model": "qwen3-omni-30b-a3b-captioner",
  "id": "chatcmpl-f4155bf9-b860-49d6-8ee2-092da7359097"
}

DashScope

Python

import dashscope
import os

# URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # Jika Anda belum mengonfigurasi variabel lingkungan, ganti baris berikut dengan Kunci API Model Studio Anda: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages
)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
         // Berikut ini adalah base-url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

The audio clip begins with a sudden, loud, metallic clanking that dominates the soundstage, immediately indicating an industrial or workshop environment. The clanking is rhythmic, consistent, and has a sharp, resonant quality, suggestive of metal tools striking metal surfaces—likely a hammer, wrench, or similar instrument being used on a hard metal object. The sound is harsh and slightly distorted, with audible clipping on each impact, likely due to the microphone’s proximity and the high volume of the sound.
As the initial clanking fades, a male voice enters, speaking in Mandarin Chinese with a tone of exasperation and complaint. His voice is clear, close-mic’d, and free from distortion. He says: “Oh my, how can I possibly work quietly like this?”. His intonation is conversational, informal, and marked by a rising, questioning inflection, typical of everyday speech rather than performance or formal address. The accent is standard Putonghua, with no strong regional markers, suggesting he is a native Mandarin speaker from the northern or central regions of China.
During the speaker’s utterance, the metallic clanking resumes, overlapping with his voice. The timing and nature of these sounds indicate the speaker is directly reacting to the ongoing noise—likely caused by another person in the same space. The environment is acoustically “dry” with minimal echo, implying a small or medium-sized room with sound-absorbing materials, further supporting the workshop or industrial setting. There are no other background noises, music, or ambient sounds, and no evidence of a public or commercial space.
The recording quality is moderate: the microphone captures both the low-end thuds and the sharp metallic transients, but the loud clanking causes digital clipping, resulting in a harsh, “crunchy” distortion during the impacts. The speaker’s voice, however, remains clear and intelligible. The overall impression is of a candid, real-world interaction—possibly a worker or office employee complaining about an interruption in a noisy environment.
In summary, the audio depicts a Mandarin-speaking man in a workshop or industrial setting, reacting with frustration to ongoing metallic clanking that disrupts his work. The recording is informal, clear, and grounded in a context of manual labor or technical work, with no evidence of scripted performance, music, or extraneous activity.

curl

# ======= Penting =======
# Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Hapus komentar ini sebelum eksekusi ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
                ]
            }
        ]
    }
}'

Response

{
  "output":{
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "The audio clip is a 6-second, high-fidelity recording set in a quiet, indoor environment. The primary sound is a male speaker, likely in his late teens to mid-20s, speaking Mandarin Chinese in a tone of mild exasperation. His speech is clear and natural, delivered in a conversational manner: “Oh, how can I possibly work quietly like this?”. The room is acoustically neutral, with no noticeable echo or background noise, suggesting a small, well-furnished space.\n\nOverlaying the speech is a persistent, rhythmic mechanical sound—a series of sharp, metallic clicks or clatters that repeat every 0.6 seconds. The sound is dry and lacks any reverberation, further supporting the inference that it is produced by a mechanical device very close to the microphone. The regularity and timbre of the sound suggest a small, metallic object (such as a key, coin, or pen) being repeatedly tapped or struck on a hard surface, rather than a larger or more complex machine.\n\nThe speaker’s complaint is a direct response to the mechanical noise, expressing frustration at being unable to concentrate or work in peace due to the disturbance. The tone is not angry or urgent, but rather one of resigned annoyance, typical of someone encountering a minor, persistent annoyance in a personal or domestic setting.\n\nThere are no other voices, music, or environmental cues present. The overall impression is of a brief, candid moment—perhaps a student, office worker, or someone in a quiet home environment—caught on microphone while complaining (to themselves or a nearby companion) about a distracting, repetitive noise. The recording is technically clean and focused, with all attention on the speaker and the mechanical sound, making it highly plausible that the clip was captured intentionally, possibly for a voice note, social media post, or as a sample for a sound effect library."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "input_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "total_tokens": 559,
    "output_tokens": 399,
    "input_tokens": 160,
    "output_tokens_details": {
      "text_tokens": 399
    }
  },
  "request_id": "d532f72c-e75b-4ffb-a1ef-d2465e758958"
}

Cara kerja

  • Interaksi satu putaran: Model ini tidak mendukung percakapan multi-putaran. Setiap permintaan merupakan tugas analisis independen.

  • Tugas tetap: Tugas utama model ini adalah menghasilkan deskripsi audio hanya dalam bahasa Inggris. Anda tidak dapat menggunakan instruksi, seperti pesan sistem, untuk mengubah perilakunya—misalnya, mengontrol format keluaran atau fokus konten.

  • Hanya input audio: Model ini hanya menerima audio sebagai input. Anda tidak perlu mengirimkan prompt teks. Format parameter message bersifat tetap.

    Contoh Format Pesan

    Kompatibel dengan OpenAI

    messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                        }
                    }
                ]
            }
        ]

    DashScope

    messages = [
        {
            "role": "user",
            "content": [
                {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
            ]
        }
    ]

Keluaran streaming

Keluaran streaming adalah metode di mana model menghasilkan dan mengeluarkan hasil secara bertahap. Hasil akhir merupakan gabungan dari hasil antara ini. Hal ini memungkinkan Anda membaca respons saat sedang dihasilkan, sehingga mengurangi waktu tunggu.

Kompatibel dengan OpenAI

Untuk mengaktifkan keluaran streaming dengan metode kompatibel OpenAI, atur parameter stream menjadi true dalam permintaan Anda.

Python

import os
from openai import OpenAI

client = OpenAI(
    # Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                    }
                }
            ]
        }
    ],
    stream=True,
    stream_options={"include_usage": True},

)
for chunk in completion:
    # Jika stream_options.include_usage bernilai True, bidang choices pada chunk terakhir adalah daftar kosong dan harus dilewati. Anda bisa mendapatkan penggunaan token dari chunk.usage.
    if chunk.choices and chunk.choices[0].delta.content != "":
        print(chunk.choices[0].delta.content,end="")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);
const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
                     },
            }]
        }],
    stream: true,
    stream_options: {
        include_usage: true
    },
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        console.log(chunk.choices[0].delta.content);
    } else {
        console.log(chunk.usage);
    }
}

curl

# ======= Penting =======
# Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Hapus komentar ini sebelum eksekusi ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{
        "include_usage":true
    }
}'

DashScope

Untuk menggunakan keluaran streaming, Anda dapat memanggil model melalui SDK DashScope atau menggunakan HTTP. Atur parameter berdasarkan metode pemanggilan Anda:

  • SDK Python: Atur parameter stream menjadi True.

  • SDK Java: Gunakan metode streamCall.

  • HTTP: Di header, atur X-DashScope-SSE menjadi enable.

Secara default, keluaran streaming bersifat non-incremental. Artinya, setiap chunk yang dikembalikan berisi semua konten yang telah dihasilkan sebelumnya. Jika Anda menginginkan keluaran streaming incremental, atur parameter incremental_output (atau incrementalOutput untuk Java) menjadi true.

Python

import dashscope

# URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
    {
        "role": "user",
        "content": [
            {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
     # Jika Anda belum mengonfigurasi variabel lingkungan, ganti baris berikut dengan Kunci API Model Studio Anda: api_key="sk-xxx",
    # Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    stream=True,
    incremental_output=True
)

full_content = ""
print("Streaming output:")
for response in response:
    if response["output"]["choices"][0]["message"].content:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
print(f"Full content: {full_content}")

Java

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // qwen3-omni-30b-a3b-captioner hanya mendukung satu file audio sebagai input.
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav");}}
                )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // Jika Anda belum mengonfigurasi variabel lingkungan, ganti baris berikut dengan Kunci API Model Studio Anda: .apiKey("sk-xxx")
                // Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                List<com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult.Output.Choice.Message.Content> content = item.getOutput().getChoices().get(0).getMessage().getContent();
                // Periksa apakah konten ada dan tidak kosong.
                if (content != null &&  !content.isEmpty()) {
                    System.out.println(content.get(0).get("text"));
                }
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Penting =======
# Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Hapus komentar ini sebelum eksekusi ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
                ]
            }
        ]
    },
    "parameters": {
      "incremental_output": true
    }
}'

Kirimkan file lokal (encoding Base64 atau jalur file)

Model ini mendukung dua metode untuk mengunggah file lokal:

  • Mengunggah menggunakan encoding Base64

  • Menggunakan jalur file langsung (Direkomendasikan untuk transmisi yang lebih stabil)

Metode pengunggahan:

Kirimkan melalui jalur file

Anda dapat mengirimkan jalur file langsung ke model. Metode ini hanya didukung oleh SDK Python dan Java DashScope, dan tidak didukung untuk panggilan HTTP. Lihat tabel berikut untuk menentukan jalur file berdasarkan bahasa pemrograman dan sistem operasi Anda.

Tentukan jalur file

Sistem

SDK

Jalur file input

Contoh

Linux atau macOS

Python SDK

file://{jalur_mutlak_file}

file:///home/images/test.mp3

Java SDK

Sistem operasi Windows

Python SDK

file://{jalur_mutlak_file}

file://D:/images/test.mp3

Java SDK

file:///{jalur_absolut_file}

file:///D:/images/test.mp3

Kirimkan melalui encoding Base64

Ubah file menjadi string yang di-encode Base64, lalu kirimkan ke model.

Langkah-langkah mengirimkan string yang di-encode Base64

  1. Encode file: Ubah file audio lokal menjadi string Base64.

    Contoh: Mengonversi file audio menjadi string Base64

    import base64
    
    # Fungsi encoding: Mengonversi file lokal menjadi string yang di-encode Base64
    def encode_audio(audio_path):
        with open(audio_path, "rb") as audio_file:
            return base64.b64encode(audio_file.read()).decode("utf-8")
    
    # Ganti xxxx/test.mp3 dengan jalur mutlak file audio lokal Anda
    base64_audio = encode_audio("xxxx/test.mp3")
  2. Buat Data URL dalam format berikut: data:;base64,{base64_audio}, di mana base64_audio adalah string Base64 yang Anda hasilkan pada langkah sebelumnya.

  3. Panggil model: Kirimkan Data URL menggunakan parameter audio (SDK DashScope) atau input_audio (SDK kompatibel OpenAI).

Batasan:

  • Mengirimkan jalur file secara langsung direkomendasikan untuk stabilitas yang lebih baik. Anda juga dapat menggunakan encoding Base64 untuk file yang ukurannya kurang dari 1 MB.

  • Saat mengirimkan jalur file secara langsung, file audio harus berukuran kurang dari 10 MB.

  • Saat mengirimkan file menggunakan encoding Base64, string yang di-encode harus berukuran kurang dari 10 MB. Encoding Base64 meningkatkan ukuran data.

Kirimkan melalui jalur file

Mengirimkan jalur file hanya didukung oleh SDK Python dan Java DashScope, dan tidak didukung untuk panggilan HTTP.

Python

import dashscope
import os

# Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Ganti ABSOLUTE_PATH/welcome.mp3 dengan jalur mutlak file audio lokal Anda.
# Jalur lengkap file lokal harus diawali dengan file:// untuk memastikan jalur yang valid, contoh: file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
    {
        "role": "user",
        # Kirimkan jalur file yang diawali dengan file:// dalam parameter audio.
        "content": [{"audio": audio_file_path}],
    }
]

response = dashscope.MultiModalConversation.call(
    # Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # Jika Anda belum mengonfigurasi variabel lingkungan, ganti baris berikut dengan Kunci API Model Studio Anda: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-omni-30b-a3b-captioner", 
    messages=messages)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException {

        // Ganti ABSOLUTE_PATH/welcome.mp3 dengan jalur mutlak file audio lokal Anda.
        // Jalur lengkap file lokal harus diawali dengan file:// untuk memastikan jalur yang valid, contoh: file:///home/images/test.mp3
        // Sistem uji saat ini adalah macOS. Jika Anda menggunakan Windows, gunakan "file:///ABSOLUTE_PATH/welcome.mp3" sebagai gantinya.

        String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", localFilePath);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                 // Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // Jika Anda belum mengonfigurasi variabel lingkungan, ganti baris berikut dengan Kunci API Model Studio Anda: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-omni-30b-a3b-captioner")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Kirimkan melalui encoding Base64

Kompatibel dengan OpenAI

Python

import os
from openai import OpenAI
import base64

client = OpenAI(
    # Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

def encode_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")


# Ganti ABSOLUTE_PATH/welcome.mp3 dengan jalur mutlak file audio lokal Anda.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

completion = client.chat.completions.create(
    model="qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        # Saat mengirimkan file lokal dengan encoding Base64, Anda harus menggunakan awalan data: untuk memastikan URL file yang valid.
                        # Kata kunci "base64" harus disertakan sebelum data yang di-encode Base64 (base64_audio), jika tidak akan terjadi error.
                        "data": f"data:;base64,{base64_audio}"
                    },
                }
            ],
        },
    ]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
        apiKey: process.env.DASHSCOPE_API_KEY,
        // URL berikut ini untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti URL dengan: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeAudio = (audioPath) => {
    const audioFile = readFileSync(audioPath);
    return audioFile.toString('base64');
};
//  Ganti ABSOLUTE_PATH/welcome.mp3 dengan jalur mutlak file audio lokal Anda.
const base64Audio = encodeAudio("xxx/ABSOLUTE_PATH/welcome.mp3")

const completion = await openai.chat.completions.create({
    model: "qwen3-omni-30b-a3b-captioner",
    messages: [
        {
            "role": "user",
            "content": [{
                "type": "input_audio",
                "input_audio": { "data": `data:;base64,${base64Audio}`}
            }]
        }]
});

console.log(completion.choices[0].message.content);

curl

  • Untuk informasi tentang cara mengonversi file menjadi string yang di-encode Base64, lihat contoh kode.

  • Untuk tujuan demonstrasi, string Base64 "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." dipotong. Dalam praktiknya, Anda harus mengirimkan string yang di-encode lengkap.

# ======= Penting =======
# Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Hapus komentar ini sebelum eksekusi ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."
            
          }
        }
      ]
    }
  ]
}'

DashScope

Python

import os
import base64
import dashscope 

dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"
# Fungsi encoding: Mengonversi file lokal menjadi string yang di-encode Base64
def encode_audio(audio_file_path):
    with open(audio_file_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode("utf-8")

# Ganti ABSOLUTE_PATH/welcome.mp3 dengan jalur mutlak file audio lokal Anda.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)
print(base64_audio)

messages = [
    {
        "role": "user",
        # Saat mengirimkan file lokal dengan encoding Base64, Anda harus menggunakan awalan data: untuk memastikan URL file yang valid.
        # Kata kunci "base64" harus disertakan sebelum data yang di-encode Base64 (base64_audio), jika tidak akan terjadi error.
        "content": [{"audio":f"data:;base64,{base64_audio}"}],
    }
]

response = dashscope.MultiModalConversation.call(
    # Jika Anda belum mengonfigurasi variabel lingkungan, ganti baris berikut dengan Kunci API Model Studio Anda: api_key="sk-xxx"
    # Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-omni-30b-a3b-captioner",
    messages=messages,
    )
print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeAudioToBase64(String audioPath) throws IOException {
        Path path = Paths.get(audioPath);
        byte[] audioBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(audioBytes);
    }

    public static void callWithLocalFile()
            throws ApiException, NoApiKeyException, UploadFileException,IOException{
        // Ganti ABSOLUTE_PATH/welcome.mp3 dengan jalur sebenarnya file lokal Anda.
        String localFilePath = "ABSOLUTE_PATH/welcome.mp3";
        String base64Audio = encodeAudioToBase64(localFilePath);

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                // Saat mengirimkan file lokal dengan encoding Base64, Anda harus menggunakan awalan data: untuk memastikan URL file yang valid.
                // Kata kunci "base64" harus disertakan sebelum data yang di-encode Base64 (base64_audio), jika tidak akan terjadi error.
                .content(Arrays.asList(
                        new HashMap<String, Object>(){{put("audio", "data:;base64," + base64Audio);}}
                ))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .model("qwen3-omni-30b-a3b-captioner")
                // Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
               // Jika Anda belum mengonfigurasi variabel lingkungan, ganti baris berikut dengan Kunci API Model Studio Anda: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("Output:\n" + result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            callWithLocalFile();
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • Untuk informasi tentang cara mengonversi file menjadi string yang di-encode Base64, lihat contoh kode.

  • Untuk tujuan demonstrasi, string Base64 "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...." dipotong. Dalam praktiknya, Anda harus mengirimkan string yang di-encode lengkap.

# ======= Penting =======
# Kunci API untuk wilayah Singapura dan Beijing berbeda. Untuk mendapatkan kunci API, lihat https://www.alibabacloud.com/help/en/model-studio/get-api-key
# Berikut ini adalah base_url untuk wilayah Singapura. Jika Anda menggunakan model di wilayah Beijing, ganti base_url dengan: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Hapus komentar ini sebelum eksekusi ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-omni-30b-a3b-captioner",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"audio": "data:;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5...."}
                ]
            }
        ]
    }
}'

Referensi API

Untuk parameter input dan output Qwen3-Omni-Captioner, lihat Qwen.

Kode error

Jika panggilan gagal, lihat Pesan error untuk troubleshooting.

FAQ

Bagaimana cara memampatkan file audio ke ukuran yang diperlukan?

  • Alat online: Anda dapat menggunakan alat online seperti Compresss untuk memampatkan file audio.

  • Implementasi program: Anda dapat menggunakan alat FFmpeg. Untuk informasi lebih lanjut tentang penggunaannya, lihat situs web resmi FFmpeg.

    # Perintah konversi dasar (templat universal)
    # -i: Menentukan jalur file input. Contoh: input.mp3
    
    # -b:a: Mengatur bitrate audio.
      # Nilai umum: 64 kbps (kualitas rendah, untuk suara dan media streaming bandwidth rendah), 128k (kualitas sedang, untuk audio umum dan podcast), 192 kbps (kualitas tinggi, untuk musik dan siaran).
      # Bitrate yang lebih tinggi menghasilkan kualitas audio yang lebih baik dan ukuran file yang lebih besar.
      
    # -ar: Mengatur laju sampel audio, yaitu jumlah sampel per detik.
     # Nilai umum: 8000 Hz, 22050 Hz, 44100 Hz (laju sampel standar).
     # Laju sampel yang lebih tinggi menghasilkan ukuran file yang lebih besar.
     
    # -ac: Mengatur jumlah saluran audio. Nilai umum: 1 (mono), 2 (stereo). File mono lebih kecil.
    
    # -y: Menimpa file output jika sudah ada (tidak perlu nilai). # output.mp3: Menentukan jalur file output.
    
    ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y

Batasan

Model ini memiliki batasan berikut untuk file audio:

  • Durasi: Kurang dari atau sama dengan 40 menit.

  • Jumlah file: Hanya satu file audio yang didukung per permintaan.

  • Format file: Format yang didukung termasuk AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, dan MP3.

  • Metode input file: URL audio yang dapat diakses publik, encoding Base64, atau jalur file lokal.

  • Ukuran file:

    • URL publik: Tidak lebih dari 1 GB.

    • Jalur file: File audio harus berukuran kurang dari 10 MB.

    • Encoding Base64: String Base64 yang di-encode harus berukuran kurang dari 10 MB. Untuk informasi lebih lanjut, lihat Mengirimkan file lokal.

    Untuk memampatkan file, lihat Bagaimana cara memampatkan file audio ke ukuran yang diperlukan?