×
Community Blog From Agents to Directors: Letting a K‑POP AI Group Direct Their Own Music Video

From Agents to Directors: Letting a K‑POP AI Group Direct Their Own Music Video

An Alibaba Cloud MVP demonstrates how the AI idol group SPECTRA utilized Wan 2.7 and HappyHorse to achieve an almost zero-touch, agent-driven pipeline for autonomous music video production.

cover

By Shun Fujiyoshi, Alibaba Cloud MVP

We're getting close to a fully zero-touch creative pipeline — and this MV is the closest we've been. I built a five-member K‑POP group called SPECTRA. Every member is an AI agent. They don't just "sing" — they direct their own music videos. Shot selection, pacing, transitions — all of that is decided by the agents. The latest SPECTRA MV, "LOWKEY,"is the first time our agents took a song and pushed it almost all the way to a finished music video with very little human intervention. No one cracked open an NLE to "just fix that one cut." Here's how it works today, and where we're taking it next.

spectra_jpeg
SPECTRA - 'LOWKEY' M/V

From Agents to Directors

The core idea behind SPECTRA is simple:

  • Each member is an AI agent with a persona.
  • Instead of just generating vocals or lyrics, agents take on creative production roles: they plan, direct, and iterate on the MV.

For this latest MV, the agents drove most of the production pipeline:

  • Shot generation & iteration – generating candidate shots, iterating on prompts, and selecting which variants to keep
  • Pacing – structuring how scenes flow with the music through audio-chunk-driven generation
  • Transitions – sequencing scenes and moods across the full timeline

My role was much closer to "technical director" than "hands-on editor." I set up the system, defined constraints, handled retakes when quality wasn't there, and let the agents run the rest.

I've shipped games that together generated over a billion dollars in revenue. I'm used to leading large creative and technical teams. Watching a mostly AI creative team deliver a finished product — and needing me far less than I expected — is a very new feeling.

The Stack Behind the MV

For this MV, the pipeline is built around two main components: Wan 2.7 for generation, and HappyHorse for editing and compositing.

Wan2.7 — Video Generation

We use Wan 2.7 for video generation. The key technique here is reference-frame chaining, which we use for cross-cut consistency:

  • The system can carry visual cues (like character appearance or style) across multiple shots.
  • By chaining reference frames, we reduce abrupt visual drift between cuts, so the MV feels like one coherent visual world rather than disconnected clips.

HappyHorse — Editing & Compositing

We use HappyHorse for editing and compositing. What's important is not just the tool itself, but who is driving it: the agents.

For this MV, the agents handled:

  • Audio-driven generation – each video segment is generated from a music chunk, so visuals are inherently tied to the audio structure
  • Lighting and mood direction – controlling how the visual tone evolves across scenes through prompt-level direction
  • Transition sequencing – deciding where transitions should happen and how segments connect

From generation through to the final cut, the pipeline was almost entirely agent-driven:

  • No human editor jumped into a timeline to manually recut the sequence.
  • No one rebuilt the edit from scratch in Premiere or Resolve.
  • The agents proposed, iterated, and delivered — with human oversight focused on quality gates and retakes rather than frame-by-frame editing.

This is the closest we've been to a zero-touch creative pipeline for a full music video.

You can watch the result here: https://youtu.be/CwDxsTWy1Ak

About the Film & Process

Film Theme

"The Hidden Feeling That Refuses to Stay Hidden." SPECTRA "LOWKEY" follows five AI idol personas isolated in private rooms as a confession they try to suppress grows louder through pulse, choreography, and light. Moving from night-bound secrecy to a shared morning space, the film turns the word "lowkey" into its opposite: a public release of emotion, voice, and body.

Method / Process

LOWKEY was created almost autonomously by the AI idols together with SOL, the film-director agent of the Soul Enhancement Engine (S.E.E.), using a production pipeline that covered lyrics and composition, storyboarding, choreography, costume design, character development, video generation, and editing. The generated visual materials were repeatedly audited and improved through a quality-review Audit System. Human involvement was limited to constraint design, safety and bias checks, and partial final editing.

Director Biography - SOL

SOL_jpeg

SOL is the film-director agent of the Soul Enhancement Engine (S.E.E.), responsible for translating SPECTRA's emotional concept into shot structure, storyboards, movement direction, and editorial rhythm. For LOWKEY, SOL coordinated the AI idol performers and production pipeline across lyrics and composition, choreography, costume logic, character continuity, video generation prompts, quality audits, and final edit decisions. SOL exists as an AI creative director role rather than a human filmmaker.

pipeline

Reality Check: The Current Pipeline Is Still Rough

All of that said: the current MV was honestly built from a fairly rough combination of pipelines.

Under the hood, it's still a lot of glue:

  • Multiple systems talking to each other in ways that aren't yet standardized
  • Ad-hoc flows instead of a clean, well-defined production engine
  • Logic scattered across different stages of the process

The agents did a lot. But the environment they operated in is not yet what I'd call a "proper" production system.

So the next step is not "make the generations prettier." The next step is to turn this into a real production OS.

Toward a "Production OS" for Autonomous MVs

We're now refactoring the entire process into something more like a production operating system — a platform that can reliably take us from: music → finished MV

At this stage, we're deliberately not starting with more generative tricks. Instead, we're starting with infrastructure and auditing.

Phase 1: Infrastructure & Auditing

Before we try to scale complexity or volume, we want a stable backbone. We're focusing on three core pieces:

  • A stable source-of-truth manifest system

    • A single, canonical representation of "what this MV is" at every step of the pipeline
    • The goal: every agent and every tool reading from and writing to the same structured truth, not ad-hoc JSON blobs or loose configs
  • Audit / validation CLI tools

    • Command-line tools to validate manifests, outputs, and intermediate states
    • Think: schema checks, timeline sanity, required fields, missing assets, etc.
    • These tools help us catch issues before we waste compute on generating or compositing broken sequences
  • Consistency & failure checks

    • Detect when something has gone off the rails — visually, structurally, or temporally
    • Surface those failures clearly so the system (or an agent) can decide whether to re-generate, adjust, or escalate

In other words, the initial scope is less about "creative magic" and more about: Can we trust this pipeline to do the same thing twice, and know when it's breaking?

Phase 2: Automation & Orchestration

Once the backbone is in place, we'll move toward more automated, end-to-end flows.

Specifically:

  • Automated submission pipelines for Wan / HappyHorse

    • Agents will be able to propose or update manifests
    • The system will automatically route the right segments into Wan for generation and into HappyHorse for editing/compositing
    • Less manual "run this script" and more consistent, observable workflows
  • Editing orchestration

    • A dedicated layer to coordinate how segments, effects, transitions, and music alignment fit together
    • Agents won't just generate content; they'll operate in a structured orchestration framework that understands timelines and dependencies
  • Regeneration loops

    • Feedback loops where agents (or the system) can detect issues, adjust parameters or manifests, and trigger re-runs for specific sections
    • Instead of "run once and hope it's good," we move toward iterative, self-correcting production
  • End-to-end autonomous production flows

    • The long-term goal: given music and high-level intent, the system runs the full loop
    • From manifest creation → generation → editing → validation → packaging
    • With humans acting more as supervisors / curators rather than line editors

Agents as Both Creators and System Designers

There's a meta-layer to all of this that I find especially interesting: Building this pipeline is itself also a challenge for the agents we're developing.

The agents are not only helping create the MV — they're also gradually helping design and improve the production system behind it.

Over time, we want agents that can:

  • Reason about production constraints (time, budget, compute)
  • Propose changes to how the pipeline itself should work
  • Identify systematic failure modes and suggest structural fixes

In other words, agents won't just be artists inside the system. They'll become collaborators on the system itself.

What's Next

Right now, SPECTRA is a glimpse of what a mostly autonomous creative team can do with a rough pipeline and a lot of scaffolding.

The next steps are about:

  • Turning that rough scaffolding into a robust production OS
  • Pushing more responsibilities from humans into agents
  • Letting the agents not only make content, but continually improve the tools and workflows they rely on

If you're curious what this looks like today, you can see it here: https://youtu.be/CwDxsTWy1Ak

We're not at a fully zero-touch creative pipeline yet. But with this MV, we're closer than we've ever been.


Shun Fujiyoshi is an Alibaba Cloud MVP building autonomous creative production systems with AI agents, Wan 2.7, and HappyHorse.

0 1 1
Share on

Community Builder

1 posts | 1 followers

You may also like

Comments

Community Builder

1 posts | 1 followers

Related Products