SAME:
A Semantically-Aligned Music Autoencoder
Abstract
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically Aligned Music autoEncoder), a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard) while maintaining excellent reconstruction quality and strong downstream generative performance. We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses. The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
Architecture and Training Losses
Total compression: (P * S)x. Dashed boxes indicate loss components.
Results
Audio Reconstruction Comparison
| Original | SAME-L | SAME-S | SAO | CoDiCodec | ACE-Step-1.5 | ɛAR-VAE | 64kbps MP3 |
|---|---|---|---|---|---|---|---|