Scaling Transformers for Low-Bitrate High-Quality Speech Coding



Julian D Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski, Zach Evans, Xubo Liu

Stability AI


Paper                   Code

Abstract

The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of $400$ or $700$ bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.

Model Architecture

Architecture of the proposed model TAAE (Transformer Audio AutoEncoder). Detail is shown for the encoder block and sub-blocks. The decoder block is configured identically to the encoder block, with the exception of the strided convolution, which is replaced with its transposed equivalent and moved to the end of the $T_m$ blocks.

Figure: Framework of proposed TAAE.

Performance Comparison

The results of the MUSHRA subjective test indicate that TAAE obtains state-of-the-art results that outperform, by a significant margin, recently published speech codecs. Importantly, the proposed model obtains results close to the ground truth. Comparing these evaluation results indicates the potential of scaling transformer-based codec architectures to achieve new benchmarks in terms of speech quality and compression.

Figure: Comparison with state-of-the-art neural audio codecs.

Audio Examples

Speech Samples (16 kHz)

Ground Truth
TAAE@0.4kbps
TAAE@0.7kbps
Mimi@0.55kbps
Mimi@1.1kbps
SemantiCodec@0.34kbps
SemantiCodec@0.68kbps
Hubert-HiFi-GAN@0.49kbps
260-123286-0018 260-123286-0018 260-123286-0018 260-123286-0018 260-123286-0018 260-123286-0018 260-123286-0018 260-123286-0018
1320-122612-0008 1320-122612-0008 1320-122612-0008 1320-122612-0008 1320-122612-0008 1320-122612-0008 1320-122612-0008 1320-122612-0008
1995-1837-0024 1995-1837-0024 1995-1837-0024 1995-1837-0024 1995-1837-0024 1995-1837-0024 1995-1837-0024 1995-1837-0024
4077-13754-0003 4077-13754-0003 4077-13754-0003 4077-13754-0003 4077-13754-0003 4077-13754-0003 4077-13754-0003 4077-13754-0003
Ground Truth
TAAE@0.4kbps
TAAE@0.7kbps
Mimi@0.55kbps
Mimi@1.1kbps
SemantiCodec@0.34kbps
SemantiCodec@0.68kbps
Hubert-HiFi-GAN@0.49kbps
4446-2273-0035 4446-2273-0035 4446-2273-0035 4446-2273-0035 4446-2273-0035 4446-2273-0035 4446-2273-0035 4446-2273-0035
4970-29093-0019 4970-29093-0019 4970-29093-0019 4970-29093-0019 4970-29093-0019 4970-29093-0019 4970-29093-0019 4970-29093-0019
5683-32865-0013 5683-32865-0013 5683-32865-0013 5683-32865-0013 5683-32865-0013 5683-32865-0013 5683-32865-0013 5683-32865-0013
7021-79730-0005 7021-79730-0005 7021-79730-0005 7021-79730-0005 7021-79730-0005 7021-79730-0005 7021-79730-0005 7021-79730-0005

Multilingual Speech Samples (16 kHz)

Language
Ground Truth
TAAE@0.7kbps
Mimi@1.1kbps
SemantiCodec@0.68kbps
SpeechTokenizer@1.5kbps
German
3503_3039_000356 3503_3039_000356 3503_3039_000356 3503_3039_000356 3503_3039_000356
 
German
136_82_000142 136_82_000142 136_82_000142 136_82_000142 136_82_000142
 
French
123_394_000304 123_394_000304 123_394_000304 123_394_000304 123_394_000304
 
French
9804_10527_000675 9804_10527_000675 9804_10527_000675 9804_10527_000675 9804_10527_000675
 
Polish
6439_5541_000240 6439_5541_000240 6439_5541_000240 6439_5541_000240 6439_5541_000240
 
Polish
6892_6779_000015 6892_6779_000015 6892_6779_000015 6892_6779_000015 6892_6779_000015
 

Causal TAAE Samples (16 kHz)

Ground Truth
TAAE
Causal TAAE
908-31957-0024 908-31957-0024 908-31957-0024
8463-294828-0031 8463-294828-0031 8463-294828-0031
61-70970-0024 61-70970-0024 61-70970-0024
2830-3980-0050 2830-3980-0050 2830-3980-0050