This document introduces workflow composition and node configuration procedures.
AI Workflow = Assemble multiple nodes like LEGO bricks to automatically execute an end-to-end processing pipeline.
AI Workflows
AI workflows include preset workflows and custom workflows.
Preset Workflows
The following preset workflows are currently available (more will be added as business needs evolve):
Face Extraction + Video Summary (OTT)
Frame Extraction & Analysis (Custom)
Speech Extraction + ASR + Text Content Extraction + Frame Extraction & Analysis (Custom)
1. Face Extraction + Video Summary (OTT)
Workflow:

Use Case:
Quickly extract facial information from videos and generate concise summaries—ideal for celebrity highlight reels in entertainment content or news anchor compilations.Advantage:
One-click execution of “face detection + content summarization,” significantly reducing manual filtering effort.
2. Frame Extraction & Analysis (Custom)
Workflow:

Use Case:
Frame-by-frame video analysis, such as ad frame detection, visual compliance checks, or keyframe extraction.Advantage:
Focused on frame-level content inspection, meeting high-precision visual analysis requirements.
3. Speech Extraction + ASR + Text Content Extraction + Frame Extraction & Analysis (Custom)
Workflow:

Use Case:
End-to-end processing that combines audio transcription, text extraction, and frame analysis—perfect for compliance review of live stream replays or multi-dimensional analysis of long-form videos.Advantage:
Covers the full pipeline from audio → text → visuals, eliminating the need to switch between multiple tools.
Custom Workflows
Custom workflows allow users to freely combine any of the five available AI processing nodes based on business needs:
Face Recognition
Speech Extraction + ASR
Text Content Extraction
Video Summary (OTT)
Frame Extraction & Analysis (Custom)
For detailed capabilities of each node, see:Node Capabilities Description
It is strongly recommended to use Speech Extraction + ASR, Text Content Extraction, and Frame Extraction & Analysis (Custom) together.
Speech Extraction + ASR provides the required input for Text Content Extraction. Using them separately will result in processing failure.
Recommended Workflow Combinations (Typical Scenarios)
Scenario 1: Full-Chain Sequential Deep Analysis
Workflow:

Use Case:
Standardized, step-by-step deep structuring of video content where output from each stage feeds into the next.Advantage:
Linear and transparent pipeline—easy to debug; enables comprehensive parsing across visual, audio, and semantic dimensions in a single task.
Scenario 2: Sequential Audio-Visual Processing
Workflow:

Use Case:
Semantic-focused analysis without person identification—e.g., meeting transcripts or Vlog content analysis—while preserving key frames and transcribed speech.Advantage:
Faster processing with reduced compute cos; optimized for “visual archiving + spoken semantics.”
Scenario 3: Parallel Audio and Video Processing
Workflow:

Use Case:
Most common media ingestion scenarios: identify who appears in the video and what is said, with no dependency between the two streams.Advantage:
High-efficiency parallel execution—failure in one branch (e.g., face recognition) does not block results from the other (e.g., text extraction).
Scenario 4: Parallel Person Identification and Frame Analysis
Workflow:

Use Case:
Rapidly capture identity information while simultaneously performing frame sampling and speech analysis.Advantage:
Decouples “person identification” from “content analysis,” ideal for time-sensitive operations.
Scenario 5: Frame-Driven Multimodal Analysis
Workflow:

Use Case:
Emphasizes visual content mining—e.g., extracting text from ad frames or recognizing PPT slides—while optionally including speech transcription.Advantage:
Maximizes utilization of visual data; best suited for “visual-first” content like tutorials or advertisements.
Scenario 6: Complex Cross-Modal Dependency Processing
Workflow:

Use Case:
Advanced multimodal fusion scenarios requiring joint conditions—e.g., “retrieve clips where Person A appears AND says Keyword X.”Advantage:
Captures the richest set of cross-modal correlations, enabling sophisticated semantic and contextual queries.