SAME: A Semantically-Aligned Music Autoencoder

Julian D Parker, Zach Evans, Matthew Rice, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

SAME:
A Semantically-Aligned Music Autoencoder

Julian D Parker, Zach Evans, CJ Carr, Zack Zukowski, Josiah Taylor, Matthew Rice, and Jordi Pons

Stability AI

arXiv Code (Inference) Code (Research) 🤗 SAME-L 🤗 SAME-S

Abstract

Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically Aligned Music autoEncoder), a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard) while maintaining excellent reconstruction quality and strong downstream generative performance. We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses. The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

Architecture and Training Losses

Total compression: (P * S)x. Dashed boxes indicate loss components.

Results

Audio Reconstruction Comparison

Original	SAME-L	SAME-S	SAO	CoDiCodec	ACE-Step-1.5	ɛAR-VAE	64kbps MP3