Foley Control: Video Guided Sound Effect Generation with a Frozen Latent Audio Model

Abstract

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model’s existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio–video dependency needed for synchronization --- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video–audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).\end{abstract}

Sample 1

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 2

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 3

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 4

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 5

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Results: MovieBench Comparison

↑ higher is better, ↓ lower is better.

System	KL-PANNs ↓	KL-PaSST ↓	IB ↑	FD-VGG ↓	FD-PANNs ↓	FD-PaSST ↓	DeSync ↓
FRIEREN	3.58	3.89	0.14	5.65	59.04	560.91	0.30
MMaudio	2.52 best	2.35	0.25	4.14 best	37.60	343.24 best	0.29 best
HunyuanVideo-Foley	2.58	2.11 best	0.30 best	7.00	31.28	373.62	0.31
FoleyControl (ours)	2.93	2.59	0.20	5.89	31.10 best	383.99	0.32
ThinkSound	3.16	2.90	0.18	6.62	33.62	468.25	0.30

Metrics: KL = mean KL divergence of classifier posteriors; IB = ImageBind audio–video similarity; FD = Fréchet Distance in different audio embedding spaces; DeSync = temporal offset (seconds).

Citation

@misc{rowles2025foleycontrol,
  title        = {Foley Control: Video Guided Sound Effect Generation with a Frozen Latent Audio Model},
  author       = {Ciara Rowles and Varun Jampani and Simon Donn{\'e} and Shimon Vainer and Julian Parker and Zach Evans},
  year         = {2025},
  url          = {https://stability-ai.github.io/foleycontrol.github.io/},
  note         = {Project page and code.}
}

Foley Control

Video Guided Sound Effect Generation with a Frozen Latent Audio Model

Abstract

Model Architecture

Sample 1

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 2

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 3

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 4

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Sample 5

Foley Control (Ours)

MMAudio

Hunyuan Foley

ThinkSound

Frieren

Original

Results: MovieBench Comparison

Citation