Sample 1
MMAudio
Hunyuan Foley
ThinkSound
Frieren
Original
Sample 2
Foley Control (Ours)
MMAudio
Hunyuan Foley
ThinkSound
Frieren
Original
Sample 3
Foley Control (Ours)
MMAudio
Hunyuan Foley
ThinkSound
Frieren
Original
Sample 4
Foley Control (Ours)
MMAudio
Hunyuan Foley
ThinkSound
Frieren
Original
Sample 5
Foley Control (Ours)
MMAudio
Hunyuan Foley
ThinkSound
Frieren
Original
Results: MovieBench Comparison
↑ higher is better, ↓ lower is better.
| System | KL-PANNs ↓ | KL-PaSST ↓ | IB ↑ | FD-VGG ↓ | FD-PANNs ↓ | FD-PaSST ↓ | DeSync ↓ |
|---|---|---|---|---|---|---|---|
| FRIEREN | 3.58 | 3.89 | 0.14 | 5.65 | 59.04 | 560.91 | 0.30 |
| MMaudio | 2.52 best | 2.35 | 0.25 | 4.14 best | 37.60 | 343.24 best | 0.29 best |
| HunyuanVideo-Foley | 2.58 | 2.11 best | 0.30 best | 7.00 | 31.28 | 373.62 | 0.31 |
| FoleyControl (ours) | 2.93 | 2.59 | 0.20 | 5.89 | 31.10 best | 383.99 | 0.32 |
| ThinkSound | 3.16 | 2.90 | 0.18 | 6.62 | 33.62 | 468.25 | 0.30 |
Metrics: KL = mean KL divergence of classifier posteriors; IB = ImageBind audio–video similarity; FD = Fréchet Distance in different audio embedding spaces; DeSync = temporal offset (seconds).
Citation
@misc{rowles2025foleycontrol,
title = {Foley Control: Video Guided Sound Effect Generation with a Frozen Latent Audio Model},
author = {Ciara Rowles and Varun Jampani and Simon Donn{\'e} and Shimon Vainer and Julian Parker and Zach Evans},
year = {2025},
url = {https://stability-ai.github.io/foleycontrol.github.io/},
note = {Project page and code.}
}