All Products
Search
Document Center

SuperApp:Node Configuration Guide

Last Updated:Apr 21, 2026

This document introduces workflow composition and node configuration procedures.

AI Workflow = Assemble multiple nodes like LEGO bricks to automatically execute an end-to-end processing pipeline.

AI Workflows

AI workflows include preset workflows and custom workflows.

Preset Workflows

The following preset workflows are currently available (more will be added as business needs evolve):

  1. Face Extraction + Video Summary (OTT)

  2. Frame Extraction & Analysis (Custom)

  3. Speech Extraction + ASR + Text Content Extraction + Frame Extraction & Analysis (Custom)

1. Face Extraction + Video Summary (OTT)

  • Workflow:image

  • Use Case:
    Quickly extract facial information from videos and generate concise summaries—ideal for celebrity highlight reels in entertainment content or news anchor compilations.

  • Advantage:
    One-click execution of “face detection + content summarization,” significantly reducing manual filtering effort.

2. Frame Extraction & Analysis (Custom)

  • Workflow:
    image

  • Use Case:
    Frame-by-frame video analysis, such as ad frame detection, visual compliance checks, or keyframe extraction.

  • Advantage:
    Focused on frame-level content inspection, meeting high-precision visual analysis requirements.

3. Speech Extraction + ASR + Text Content Extraction + Frame Extraction & Analysis (Custom)

  • Workflow:
    image

  • Use Case:
    End-to-end processing that combines audio transcription, text extraction, and frame analysis—perfect for compliance review of live stream replays or multi-dimensional analysis of long-form videos.

  • Advantage:
    Covers the full pipeline from audio → text → visuals, eliminating the need to switch between multiple tools.

Custom Workflows

Custom workflows allow users to freely combine any of the five available AI processing nodes based on business needs:

  • Face Recognition

  • Speech Extraction + ASR

  • Text Content Extraction

  • Video Summary (OTT)

  • Frame Extraction & Analysis (Custom)

For detailed capabilities of each node, see:Node Capabilities Description

Note
  • It is strongly recommended to use Speech Extraction + ASR, Text Content Extraction, and Frame Extraction & Analysis (Custom) together.

  • Speech Extraction + ASR provides the required input for Text Content Extraction. Using them separately will result in processing failure.

Recommended Workflow Combinations (Typical Scenarios)

Scenario 1: Full-Chain Sequential Deep Analysis

  • Workflow:
    image

  • Use Case:
    Standardized, step-by-step deep structuring of video content where output from each stage feeds into the next.

  • Advantage:
    Linear and transparent pipeline—easy to debug; enables comprehensive parsing across visual, audio, and semantic dimensions in a single task.

Scenario 2: Sequential Audio-Visual Processing

  • Workflow:
    image

  • Use Case:
    Semantic-focused analysis without person identification—e.g., meeting transcripts or Vlog content analysis—while preserving key frames and transcribed speech.

  • Advantage:
    Faster processing with reduced compute cos; optimized for “visual archiving + spoken semantics.”

Scenario 3: Parallel Audio and Video Processing

  • Workflow:
    image

  • Use Case:
    Most common media ingestion scenarios: identify who appears in the video and what is said, with no dependency between the two streams.

  • Advantage:
    High-efficiency parallel execution—failure in one branch (e.g., face recognition) does not block results from the other (e.g., text extraction).

Scenario 4: Parallel Person Identification and Frame Analysis

  • Workflow:
    image

  • Use Case:
    Rapidly capture identity information while simultaneously performing frame sampling and speech analysis.

  • Advantage:
    Decouples “person identification” from “content analysis,” ideal for time-sensitive operations.

Scenario 5: Frame-Driven Multimodal Analysis

  • Workflow:
    image

  • Use Case:
    Emphasizes visual content mining—e.g., extracting text from ad frames or recognizing PPT slides—while optionally including speech transcription.

  • Advantage:
    Maximizes utilization of visual data; best suited for “visual-first” content like tutorials or advertisements.

Scenario 6: Complex Cross-Modal Dependency Processing

  • Workflow:
    image

  • Use Case:
    Advanced multimodal fusion scenarios requiring joint conditions—e.g., “retrieve clips where Person A appears AND says Keyword X.”

  • Advantage:
    Captures the richest set of cross-modal correlations, enabling sophisticated semantic and contextual queries.