stable-audio-demo

⚠️ Warning: This website may not function properly on Safari. For the best experience, please use Google Chrome.

arXiv: Stable Audio’s paper

stable-audio-tools: code to reproduce Stable Audio

stable-audio-metrics: code to evaluate Stable Audio

Our model can generate variable-length and long-form stereo music at 44.1kHz:

Generated Stereo Music Prompt
Berlin techno, rave, drum machine, kick, ARP synthesizer, dark, moody, hypnotic, evolving, 135 BPM. Loop.
Uplifting acoustic loop. 120 BPM.
Disco, Driving Drum Machine, Synthesizer, Bass, Piano, Guitars, Instrumental, Clubby, Euphoric, Chicago, New York, 115 BPM.
Calm meditation music to play in a spa lobby.
Drum solo.

Differently from pervious state-of-the-art models, ours can generate stereo sound effects at 44.1kHz:

Generated Stereo Sounds Prompt
Door slam. High-quality, stereo.
Sports car passing by. High-quality, stereo.
Motorbike passing by. High-quality, stereo.
Fireworks. High-quality, stereo.
Reverberant footsteps inside a large rocky cave. High-quality, stereo.

Note that all the examples in this website are generated with the same model that can generate both variable-length music and sound effects at 44.1kHz stereo. We append “high-quality, stereo” to our sound effects prompts because it is generally helpful.

Long-form stereo music: comparison with state-of-the-art with MusicCaps prompts

Prompt: This song contains someone strumming a melody on a mandolin while more people are whistling along. Then a mandolin, an e-bass and an acoustic guitar are playing a short melody in a lower key before breaking into the next part along with flutes and percussions. This song may be played outside by musicians performing.

Our Model MusicGen-large MusicGen-stereo AudioLDM2
(stereo, 44.1kHz) (mono, 32kHz) (stereo, 32kHz) (mono, 48kHz)

Prompt: The commercial music features a groovy piano melody played over snare rolls in the first half of the loop. Right after, there is a drop that consists of a punchy “4 on the floor” kick pattern, shimmering hi hats, claps, groovy piano and wide synth lead melody. It sounds happy, fun, euphoric and exciting.

Our Model MusicGen-large MusicGen-stereo AudioLDM2
(stereo, 44.1kHz) (mono, 32kHz) (stereo, 32kHz) (mono, 48kHz)

These prompts/audios were used for the qualitative study we report in our paper.

Sound effects: comparison with state-of-the-art with AudioCaps prompts

Prompt: Clicking and sputtering then eventual revving of an idling engine.

Model Audiogen-medium AudioLDM2
(stereo, 44.1kHz) (mono, 32kHz) (mono, 48kHz)

Prompt: Birds chirping loudly.

Model Audiogen-medium AudioLDM2
(stereo, 44.1kHz) (mono, 32kHz) (mono, 48kHz)

These prompts/audios were used for the qualitative study we report in our paper. Note the (randomly) selected prompts from AudioCaps did not require substantial stereo movement, resulting in renders that are relatively non-spatial.

Autoencoder: reconstructions

This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the autoencoder. Note that the autoencoder reconstruction is fairly transparent, very close to the ground truth.

Ground truth  Autoencoder reconstruction