stable-audio-2-demo

⚠️ Warning: This website may not function properly on Safari. For the best experience, please use Google Chrome.

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m 45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5 Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Comparison with state-of-the-art (song describer dataset prompts)

Prompt: An uplifting jazz song that makes your head shake.

Our Model (stereo, 44.1kHz)	MusicGen-large-stereo (stereo, 32kHz)

Prompt: One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.

Our Model	MusicGen-large-stereo

Prompt: Ambiental song that evokes calm with a progression of stereo electronic elements.

Our Model	MusicGen-large-stereo

Prompt: This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.

Our Model (stereo, 44.1kHz)	MusicGen-large-stereo (stereo, 32kHz)	Ground-truth (stereo, 44.1kHz)

Prompt: Calming instrumental music primarily on piano can be used for relaxing.

Our Model	MusicGen-large-stereo	Ground-truth

Prompt: A dance music club banger, with a heavy kick, subtle supporting percussion like tabla and bongos, prominent pop synth lines, and a repetitive hook.

Our Model	MusicGen-large-stereo	Ground-truth

These prompts/audios were used for the qualitative study we report in our paper.

Additional creative capabilities

Audio-to-audio With diffusion models is possible to perform some degree of style-transfer by initializing the noise with audio during sampling. This capability can be used to modify the aesthetics of an existing recording based on a given text prompt, whilst maintaining the reference audio’s structure (e.g., a beatbox recording could be style-transfered to produce realistic-sounding drums). As a result, our model can be influenced by not only text prompts but also audio inputs, enhancing its controllability and expressiveness. We noted that when initialized with voice recordings (such as beatbox or onomatopoeias), there is a sensation of control akin to an instrument.

Input audio	Output audio	Prompt
		Bass guitar
		format: solo, instruments: vibraphone
		Genre: UK Bass, Instruments: 707 Drum Machine, Strings, 808 bass stabs, Beautiful Synths
		Guitar
		Drums

Vocal music The training dataset contains a subset of music with vocals. Our focus is on the generation of instrumental music, so we do not provide any conditioning based on lyrics. As a result, when the model is prompted for vocals, the model’s generations contains vocal-like melodies without intelligible words. Whilst not a substitute for intelligible vocals, these sounds have an artistic and textural value of their own.

Short-form audio generation The training set does not exclusively contain long-form music. It also contains shorter sounds like sound effects or instrument samples. As a consequence, our model is also capable of producing such sounds when prompted appropriately.

Generation by our model	Prompt
	Dog barking
	Ringtone
	Waves
	Helicopter passing by from left to right
	Fowl, chicken, rooster, crowing, cock-a-doodle-doo

Memorization analysis

Recent works examined the potential of generative models to memorize training data, especially for repeated elements in the training set. Further, musicLM conducted a memorization analysis to address concerns on the potential misappropriation of creative content. Adhering to principles of responsible model development, we also run a comprehensive study on memorization.

Considering the increased probability of memorizing repeated music within the dataset, we start by studying if our training set contains repeated data. We embed all our training data using the LAION-CLAP audio encoder to select audios that are close in this space based on a manually set threshold. The threshold is set such that the selected audios correspond to exact replicas. With this process, we identify 5566 repeated audios in our training set.

We compare our model’s generations against the training set in LAION-CLAP space. Generations are from 5566 prompts within the repeated training data (in-distribution), and 586 prompts from the Song Describer Dataset (no-singing, out-of-distribution). We then identify the top-50 generated music that is closest to the training data and listen.

We extensively listened to potential memorization candidates, and could not find memorization. Those are the most interesting candidates from (repeated) training data prompts:

Closest #1	Closest #2	Closest #3	Prompt
427160	427105	140843	Birds chirping, forest birds, tropical, africa wild life, singing birds, sound effects.
978924	979616	978717	Totally rad 8-bit melodies and intense arps create that fearless throwback vibe.
979544	979695	979670	Totally rad 8-bit melodies and intense arps create that strong-willed throwback vibe.
972466	972983	973055	Pleasant strings create desire in this adamant scoring cue.

We found a fair ammount of 8-bit/chiptunes that were repeated in the training dataset. Still, our model does not memorize them.

We even selected additional outstanding generations from Song Describer Dataset prompts, and could not find memorization. Those are the most interesting memorization candidates:

Closest #1	Closest #2	Closest #3	Prompt
796563	1083119	634461	One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.
279428	1082095	326758	An uplifting jazz song that makes your head shake.
1024058	1023046	788950	Calming instrumental music primarily on piano can be used for relaxing.
470048	470047	696082	This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.

Autoencoder: reconstructions

This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the autoencoder. Note that the autoencoder reconstruction is fairly transparent, very close to the ground truth.

Ground truth	Autoencoder reconstruction