PALLE
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Abstract. Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed.
Contents
Model Overview
Figure. Illustration of the two-stage PALLE framework. (a) The shared architecture: a bidirectional masked generative transformer. (b) Stage one: the model predicts all token positions in parallel but retains only the leftmost span at each step. (c) Stage two: the model refines the initial speech tokens, where low-confidence tokens are re-masked and regenerated using contextual information.
Zero-Shot Text-to-Speech for Cross-Sentence Task
Samples are from LibriSpeech dataset.
English Text | Speaker Prompt | MaskGCT | E2 TTS | F5-TTS | PALLE (estimated duration) | PALLE (groud-truth duration) |
---|---|---|---|---|---|---|
then she gave rosalie back her magic ring thanking the kind witch for all she had done for them | ||||||
the ideas also remain but they have become types in nature forms of men animals birds fishes | ||||||
the only cheerful conversation was the conversation across the table between naomi and me | ||||||
they informed the english parliament of this unexpected incident and assured them that they had entered into no private treaty with the king | ||||||
the others having been in operation too short a time to show definite results although they also went quickly to a dividend basis | ||||||
she wanted a glance of the new books and periodicals and talk of great philanthropies and reforms | ||||||
all my danger and sufferings were needed to strike a spark of human feeling out of him but now that i am well his nature has resumed its sway | ||||||
they do not go where the enemies of the gospel predominate they go where the christians are | ||||||
you have received us with all that courtesy and hospitality for which your character in england stands so high | ||||||
even so i had just returned from an arduous journey exhausted and badly needing a rest |
Ethics Statement
PALLE is purely a research project. PALLE could synthesize speech that maintains speaker identity and could be used for education, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chat-bot, and so on. While PALLE can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.
This page is for research demonstration purposes only.