stable-audio-2-demo

⚠️ Warning: This website may not function properly on Safari. For the best experience, please use Google Chrome.

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m 45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5 Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Comparison with state-of-the-art (song describer dataset prompts)

Prompt: An uplifting jazz song that makes your head shake.

Our Model (stereo, 44.1kHz) MusicGen-large-stereo (stereo, 32kHz)

Prompt: One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.

Our Model MusicGen-large-stereo

Prompt: Ambiental song that evokes calm with a progression of stereo electronic elements.

Our Model MusicGen-large-stereo

Prompt: This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.

Our Model (stereo, 44.1kHz) MusicGen-large-stereo (stereo, 32kHz) Ground-truth (stereo, 44.1kHz)

Prompt: Calming instrumental music primarily on piano can be used for relaxing.

Our Model MusicGen-large-stereo Ground-truth

Prompt: A dance music club banger, with a heavy kick, subtle supporting percussion like tabla and bongos, prominent pop synth lines, and a repetitive hook.

Our Model MusicGen-large-stereo Ground-truth

These prompts/audios were used for the qualitative study we report in our paper.

Additional creative capabilities

Audio-to-audio With diffusion models is possible to perform some degree of style-transfer by initializing the noise with audio during sampling. This capability can be used to modify the aesthetics of an existing recording based on a given text prompt, whilst maintaining the reference audio’s structure (e.g., a beatbox recording could be style-transfered to produce realistic-sounding drums). As a result, our model can be influenced by not only text prompts but also audio inputs, enhancing its controllability and expressiveness. We noted that when initialized with voice recordings (such as beatbox or onomatopoeias), there is a sensation of control akin to an instrument.

Input audio Output audio Prompt


Bass guitar


format: solo, instruments: vibraphone


Genre: UK Bass, Instruments: 707 Drum Machine, Strings, 808 bass stabs, Beautiful Synths


Guitar


Drums

Vocal music The training dataset contains a subset of music with vocals. Our focus is on the generation of instrumental music, so we do not provide any conditioning based on lyrics. As a result, when the model is prompted for vocals, the model’s generations contains vocal-like melodies without intelligible words. Whilst not a substitute for intelligible vocals, these sounds have an artistic and textural value of their own.







Short-form audio generation The training set does not exclusively contain long-form music. It also contains shorter sounds like sound effects or instrument samples. As a consequence, our model is also capable of producing such sounds when prompted appropriately.

Generation by our model Prompt

Dog barking

Ringtone

Waves

Helicopter passing by from left to right

Fowl, chicken, rooster, crowing, cock-a-doodle-doo

Memorization analysis

Recent works examined the potential of generative models to memorize training data, especially for repeated elements in the training set. Further, musicLM conducted a memorization analysis to address concerns on the potential misappropriation of creative content. Adhering to principles of responsible model development, we also run a comprehensive study on memorization.

Considering the increased probability of memorizing repeated music within the dataset, we start by studying if our training set contains repeated data. We embed all our training data using the LAION-CLAP audio encoder to select audios that are close in this space based on a manually set threshold. The threshold is set such that the selected audios correspond to exact replicas. With this process, we identify 5566 repeated audios in our training set.

We compare our model’s generations against the training set in LAION-CLAP space. Generations are from 5566 prompts within the repeated training data (in-distribution), and 586 prompts from the Song Describer Dataset (no-singing, out-of-distribution). We then identify the top-50 generated music that is closest to the training data and listen.

We extensively listened to potential memorization candidates, and could not find memorization. Those are the most interesting candidates from (repeated) training data prompts:

Generation by our model Closest #1 Closest #2 Closest #3 Prompt
427160 427105 140843 Birds chirping, forest birds, tropical, africa wild life, singing birds, sound effects.
978924 979616 978717 Totally rad 8-bit melodies and intense arps create that fearless throwback vibe.
979544 979695 979670 Totally rad 8-bit melodies and intense arps create that strong-willed throwback vibe.
972466 972983 973055 Pleasant strings create desire in this adamant scoring cue.

We found a fair ammount of 8-bit/chiptunes that were repeated in the training dataset. Still, our model does not memorize them.

We even selected additional outstanding generations from Song Describer Dataset prompts, and could not find memorization. Those are the most interesting memorization candidates:

Generation by our model Closest #1 Closest #2 Closest #3 Prompt
796563 1083119 634461 One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.
279428 1082095 326758 An uplifting jazz song that makes your head shake.
1024058 1023046 788950 Calming instrumental music primarily on piano can be used for relaxing.
470048 470047 696082 This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.

Autoencoder: reconstructions

This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the autoencoder. Note that the autoencoder reconstruction is fairly transparent, very close to the ground truth.

Ground truth Autoencoder reconstruction