SAME:
A Semantically-Aligned Music Autoencoder

Julian D Parker, Zach Evans, CJ Carr, Zack Zukowski, Josiah Taylor, Matthew Rice, and Jordi Pons
Stability AI

Abstract

Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically Aligned Music autoEncoder), a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard) while maintaining excellent reconstruction quality and strong downstream generative performance. We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses. The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

Architecture and Training Losses

Architecture and Training Losses

Total compression: (P * S)x. Dashed boxes indicate loss components.

Results

Results

Audio Reconstruction Comparison

Original SAME-L SAME-S SAO CoDiCodec ACE-Step-1.5 ɛAR-VAE 64kbps MP3