Skip to content
Foley Control

Foley Control

Video Guided Sound Effect Generation with a Frozen Latent Audio Model

Ciara Rowles1 Varun Jampani1 Simon Donné1 Shimon Vainer1 Julian Parker1 Zach Evans1
1Stability AI
Paper (PDF) Citation

Abstract

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model’s existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio–video dependency needed for synchronization --- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video–audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).\end{abstract}

Model Architecture

Overview of Foley Control: Video Guided Sound Effect Generation with a Frozen Latent Audio Model.

Diagram of the Foley Control cross-attention adapters between V-JEPA2 and Stable Audio DiT

Sample 1

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 2

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 3

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 4

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 5

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Results: MovieBench Comparison

↑ higher is better, ↓ lower is better.

System KL-PANNs ↓ KL-PaSST ↓ IB ↑ FD-VGG ↓ FD-PANNs ↓ FD-PaSST ↓ DeSync ↓
FRIEREN 3.58 3.89 0.14 5.65 59.04 560.91 0.30
MMaudio 2.52 best 2.35 0.25 4.14 best 37.60 343.24 best 0.29 best
HunyuanVideo-Foley 2.58 2.11 best 0.30 best 7.00 31.28 373.62 0.31
FoleyControl (ours) 2.93 2.59 0.20 5.89 31.10 best 383.99 0.32
ThinkSound 3.16 2.90 0.18 6.62 33.62 468.25 0.30

Metrics: KL = mean KL divergence of classifier posteriors; IB = ImageBind audio–video similarity; FD = Fréchet Distance in different audio embedding spaces; DeSync = temporal offset (seconds).

Citation

@misc{rowles2025foleycontrol,
  title        = {Foley Control: Video Guided Sound Effect Generation with a Frozen Latent Audio Model},
  author       = {Ciara Rowles and Varun Jampani and Simon Donn{\'e} and Shimon Vainer and Julian Parker and Zach Evans},
  year         = {2025},
  url          = {https://stability-ai.github.io/foleycontrol.github.io/},
  note         = {Project page and code.}
}