MARS6-turbo

A Small and Robust Hierarchical-Codec TTS Model

Matthew Baas, Pieter Scholtz, Arnav Mehta, Elliott Dyson, Akshat Prakash, Herman Kamper
(IEEE - ICASSP 2025)

Code (coming soon) Paper Demo


Abstract

Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability.

Contents

Model Overview

MARS6 Model Overview

MARS6 hierarchical TTS architecture.

MARS6 is an encoder-decoder transformer. The encoder converts a speaker embedding and sequence of text embeddings to latent vectors for cross-attention in the global decoder. The hierarchical autoregressive decoder has two parts: The global decoder produces new latent vectors at a low sample rate, where each vector is autoregressively decoded to acoustic tokens using a smaller local decoder model. The entire patch of acoustic tokens then forms the next input vector to the global decoder through a patch embedding.

Audio Showcase

Below are reference speech samples paired with MARS6-turbo (70M) outputs (Deep clone vs. Shallow clone) and other baselines (MetaVoice-1B, StyleTTS2, XTTSv2). Each sample includes transcripts for the reference and the target generation text. Samples are selected from the EARS test set to represent wide variation in references.