Node Configuration Guide - SuperApp - Alibaba Cloud Documentation Center

Configure AI video processing workflows by composing nodes that handle face detection, speech transcription, frame sampling, and text extraction.

An AI workflow chains multiple nodes into an end-to-end processing pipeline. Each node performs a discrete AI task — face detection, speech transcription, frame sampling, or text extraction — and you combine them to match your use case.

AI workflows

AI workflows come in two forms: preset workflows and custom workflows.

Preset workflows

Three preset workflows are available. Each bundles a fixed set of nodes optimized for a common scenario:

Face Extraction + Video Summary (OTT)
Frame Extraction & Analysis (Custom)
Speech Extraction + ASR + Text Content Extraction + Frame Extraction & Analysis (Custom)

1. Face Extraction + Video Summary (OTT)

Workflow:
Use case: Extract facial information from video and generate a concise summary. Best for Over-The-Top (OTT) content such as celebrity highlight reels or news anchor compilations.
Advantage: Combines face detection and content summarization in a single run, reducing manual filtering effort.

2. Frame Extraction & Analysis (Custom)

Workflow:
Use case: Frame-by-frame video analysis — ad frame detection, visual content moderation, or keyframe extraction.
Advantage: Focused on frame-level content inspection, suited for high-precision visual analysis.

3. Speech Extraction + ASR + Text Content Extraction + Frame Extraction & Analysis (Custom)

Workflow:
Use case: End-to-end pipeline combining audio transcription, text extraction, and frame analysis. Suitable for content moderation of live stream replays or multi-dimensional analysis of long-form videos.
Advantage: Covers the full audio → text → visual pipeline in one workflow, removing the need to coordinate multiple tools.

Custom workflows

Custom workflows let you combine any of the five available AI processing nodes to fit your specific use case:

Face Recognition
Speech Extraction + ASR
Text Content Extraction
Video Summary (OTT)
Frame Extraction & Analysis (Custom)

For the capabilities of each node, see:Node Capabilities Description

Note

Use Speech Extraction + ASR (Automatic Speech Recognition), Text Content Extraction, and Frame Extraction & Analysis (Custom) together when your workflow requires text analysis from speech.
Speech Extraction + ASR produces the transcription that Text Content Extraction requires as input. Running Text Content Extraction without a preceding Speech Extraction + ASR node will cause the task to fail.

Recommended workflow combinations

Use the following table to select a combination based on your scenario. For details on each combination, see the sections below.

Scenario	Pattern	Best for	Trade-off
1. Full-chain sequential deep analysis	Sequential	Step-by-step deep structuring where each stage feeds the next	Longer end-to-end latency; failures in early stages block later ones
2. Sequential audio-visual processing	Sequential	Semantic analysis without person identification (meeting transcripts, vlogs)	No face data; not suitable when identity tracking is required
3. Parallel audio and video processing	Parallel	Most media ingestion — identify who appears and what is said independently	Requires conflict resolution if results from both branches must be merged
4. Parallel person identification and frame analysis	Parallel	Time-sensitive identity capture alongside frame sampling and speech analysis	Higher resource usage from simultaneous execution
5. Frame-driven multimodal analysis	Visual-first + optional speech	Visual content mining — ad frames, PPT slide extraction — with optional audio	Not optimized for identity tracking
6. Complex cross-modal dependency processing	Sequential with cross-modal joins	Advanced fusion — retrieve clips where a specific person appears and says a keyword	Most complex to configure; higher compute cost

Scenario 1: Full-chain sequential deep analysis

Workflow:
Use case: Deep structuring of video content where each stage's output feeds the next — visual parsing, audio transcription, and semantic analysis in sequence.
Advantage: Linear, transparent pipeline that is easy to debug. Delivers comprehensive parsing across visual, audio, and semantic dimensions in a single task.

Scenario 2: Sequential audio-visual processing

Workflow:
Use case: Semantic-focused analysis without person identification — meeting transcripts, vlog content analysis — while preserving keyframes and transcribed speech.
Advantage: Faster processing at lower compute cost. Suited for visual archiving combined with spoken-word semantics.

Scenario 3: Parallel audio and video processing

Workflow:
Use case: Most common media ingestion scenario — identify who appears in the video and what is said, with the two streams processed independently (fan-out/fan-in).
Advantage: High-efficiency parallel execution. A failure in one branch (such as face recognition) does not block results from the other (such as text extraction).

Scenario 4: Parallel person identification and frame analysis

Workflow:
Use case: Capture identity information quickly while simultaneously running frame sampling and speech analysis.
Advantage: Decouples person identification from content analysis. Best suited for time-sensitive operations where latency matters.

Scenario 5: Frame-driven multimodal analysis

Workflow:
Use case: Visual content mining — extracting text from ad frames or recognizing PPT slides — with optional speech transcription.
Advantage: Maximizes visual data utilization. Best suited for visual-first content such as tutorials or advertisements.

Scenario 6: Complex cross-modal dependency processing

Workflow:
Use case: Advanced multimodal fusion requiring joint conditions — for example, retrieving clips where a specific person appears and says a particular keyword.
Advantage: Captures the richest set of cross-modal correlations, enabling sophisticated semantic and contextual queries across person identity, speech, and visual content simultaneously.