All Products
Search
Document Center

Intelligent Media Services:Create and manage workflow templates

Last Updated:Apr 22, 2025

In AI real-time interaction, a workflow consists of a sequence of nodes, each dedicated to a task, such as speech-to-text (STT), text-to-speech (TTS), LLM text generation, and avatar integration. An AI agent follows a structured workflow to interact with end users. Intelligent Media Services (IMS) provides preset workflow templates for multiple scenarios, including audio calls and avatar calls. This topic introduces how to configure a real-time workflow template.

Limitations

  • Preset templates are accessible to all users.

  • Custom templates are exclusive to their creators.

  • Preset templates cannot be deleted.

Workflow types

There are four types of workflows: audio call, avatar call, visual call, and messaging. Each workflow type has preset nodes. You can configure the nodes as needed.

Type

Scenario

Nodes

Audio call

Ideal for one-on-one or group calls. Users can talk with intelligent assistants and obtain instant feedback and services.

  1. Start (Output over RTC): Pulls the audio streams of users over RTC. 

  2. Speech-to-text: Converts audio streams into text.

  3. LLM: Imports the converted text to an LLM for processing. 

  4. Text-to-speech: Converts the processed text to audio streams.

  5. End (Stream Ingest over RTC): Ingests the converted audio streams over RTC. 

Avatar call

Users can make video calls with a 3D avatar, which features rich body language and facial expressions. It creates more engaging and realistic interactions.

  1. Start (Output over RTC): Pulls the audio and video streams of users over RTC. 

  2. Speech-to-text: Converts audio streams into text.

  3. LLM: Imports the converted text to an LLM for processing. 

  4. Text-to-speech: Converts the processed text to audio streams.

  5. 3D Avatar: Integrates an avatar from Faceunity or Tongyi Xingchen and generates video streams where the avatar moves, reacts, and performs lip movements synchronized with the processed text and audio.

  6. End (Stream Ingest over RTC): Packages the audio and video streams and ingests them over RTC. 

Visual call

In video calls with users, the AI agent analyzes camera feeds. This process involves object recognition, scene understanding and segmentation, semantic analysis, and behavior recognition. It can understand the contextual setting of objects.

  1. Start (Output over RTC): Pulls the audio and video streams of users over RTC. 

  2. Video processing:

    1. Frame Extractor: Extract frames from video streams.

    2. Speech-to-text: Converts audio streams into text.

  3. Multimodal LLM: Processes the video content with the multimodal model.

  4. Text-to-speech: Converts the processed text to audio streams.

  5. End (Stream Ingest over RTC): Ingests the converted audio streams over RTC. 

Messaging

Users can communicate with agents through voice or text messages.

  1. LLM: Inputs the converted text to the LLM for processing and analysis.

  2. (Optional) Speech-to-text: Converts audio streams into text.

  3. (Optional) Text-to-speech: Converts the processed text to audio streams.

Create a template in the console

  1. Log on to the IMS console. In the left-side navigation pane, choose Intelligent Production > AI Real-time Interaction > Workflows. Click Create Workflow Template.

    image

  2. Configure the basic Information. 

    Customize the workflow name. You will reference it when creating an AI agent.

    image

  3. Choose a workflow type and configure the nodes.

    Each workflow comes with preset nodes that cannot be added or removed, but you can edit the nodes.

    Speech-to-text

    This node converts input audio into text. Multiple languages are supported. You can use the preset STT model, third-party plug-in, or self-developed model.

    image

    • Preset: The preset model supports the following configurations:

      • Language Model: Select the language based on the input.

      • Silent Time: When there is no voice input, the STT node waits for the specified period before transmitting the transcribed text to the LLM.

      • Custom Hotword: Upload hotword files to ensure accurate recognition of specific terms or phrases related to your business.

    • Third-party plug-in: only iFLYTEK is supported. Visit iFLYTEK official website to learn more.

    • Self-developed: To integrate a self-developed STT model, see STT standard interface.

      Note

      To use a self-developed model, whitelisting is required. For more information, consult us by joining the DingTalk Group.

    LLM

    This node uses a pre-trained LLM to understand the input text converted by the STT node and generate text responses.

    image

    You have the following options:

    • Use the preset Qwen model

    • Integrate a model from Alibaba Cloud Model Studio

    • Integrate a model from Tongyi Xingchen

    • Access a self-developed model

    Alibaba Cloud Model studio

    Alibaba Cloud Model Studio is an all-in-one platform for foundation model development and application building. You can integrate a model from Model Center or Application Center of Model Studio.

    • Model Center: Deploy a model in Model Studio and get the ModelId and API key.

    • Application Center: Create an agent application in Model Studio and get its AppId.

      image

      Click Call, then click API KEY in the upper right corner to get API-KEY.image

    Note

    To learn how to access Model Studio in AI real-time interaction, see Pass through business parameters to Alibaba Cloud Model Studio.

    Tongyi Xingchen

    Tongyi Xingchen lets you customize an AI agent with a special persona. Combined with the real-time voice communication capability, it can engage in interactions across various specified scenarios.

    To integrate a model from Tongyi Xingchen, specify ModelId and API-KEY.

    • ModelId: Tongyi Xingchen supports xingchen-lite, xingchen-base, xingchen-plus, xingchen-plus-v2, and xingchen-max.

    • API-KEY: Visit Tongyi Xingchen and create an API key.

    Self-developed model

    Integrate a self-developed model into the workflow based on the OpenAI specifications:

    Name

    Description

    Sample value

    ModelId

    The model field, which represents the model name.

    abc

    API-KEY

    The api_key field.

    AUJH-pfnTNMPBm6iWXcJAcWsrscb5KYaLitQhHBLKrI

    HTTPS URL of Destination Model

    The base_url field, which represents the service request address.

    http://www.abc.com

    For more information on self-developed LLMs, see Access LLMs.

    Text-to-speech

    This node converts processed text back into speech, allowing users to hear the responses.

    image

    The following options are available:

    • Preset Template: You can select a voice. For voice effects, see Intelligent voice samples.

    • Self-developed Template: Integrate your self-developed model into the workflow. For specific requirements, see Access TTS models.

    • Third-party Plug-in: Only the MiniMax Speech Model is supported. This model caters to complex productivity and multilingual dialogue scenarios, supporting a context window of 245k. For more information, see MiniMax model.

    • Alibaba Cloud Model Studio: If you want to customize agent voice, access Alibaba Cloud Model Studio.

    You can also select specific symbols to prevent the agent from vocalizing them.

    image

    Avatar

    This node generates the video stream of the avatar which moves and speaks according to the processed text and audio, with rich facial expressions.

    image

    Avatars from Faceunity and Tongyi Xingchen are supported.

    • Faceunity: Consult Faceunity customer support to activate the service. You must get AppId, AppKey, and AvatarId.

    • Tongyi Xingchen: Go to Tongyi Xingchen platform to get ModelId, AppKey, and AvatarId.

      Note

      To integrate an avatar from Tongyi Xingchen, you need to request whitelisting. For more information, join the DingTalk group to contact us.

    To learn more about avatar integration, see Avatar integration.

    Frame Extractor

    This node extracts single or multiple frames from the video for further processing.

    image

    Multi-modal LLM

    MLLM interprets the input images and text to generate text responses.

    image

    You can select a preset Qwen model or integrate a self-developed model following the OpenAI specifications. For more information, see Access MLLMs.

  4. Click Save.

Manage workflow templates

All workflow templates are displayed on the Real-time Workflow Template page. You can perform the following operations on a template:

  • View details: Click Manage in the Actions column.

  • Modify: On the workflow details page, click Modify in the upper-right corner. Modify workflow name and node configurations.

  • Delete: Click Delete in the Actions column.

    • The preset templates cannot be deleted. 

    • If an AI agent is running based on a custom workflow template, this template cannot be deleted. 

Use workflow templates

Initiate a workflow in the IMS console:

When you create an AI agent, select a workflow template to automate the processing of audio and video streams over RTC. The tasks such as TTS, STT, and smart communication are performed.