LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

StyleTTS 2 - Closes gap further on TTS quality + Voice generation from samples (alien.top)

submitted 2 years ago by super-helper@alien.top to c/localllama@poweruser.forum

10 comments fedilink hide all child comments

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

Paper: https://arxiv.org/abs/2306.07691

Audio samples: https://styletts2.github.io/

top 10 comments

sorted by: hot top controversial new old

[–] _throawayplop_@alien.top 1 points 2 years ago

As a note, I tested to process 2 samples with adobe's podcast enhance tool and it was very effective in removing the slight metallic artifacts

[–] Lirezh@alien.top 1 points 2 years ago

It would be the future if it supported european languages

[–] xadiant@alien.top 1 points 2 years ago (1 children)

Goddammit, I just fine-tuned Tortoise with custom voice. Can't wait for webui's for the StyleTTS. Hope it's easy to fine-tune

[–] AWAS666@alien.top 1 points 2 years ago (2 children)

Yep it is, takes around 4 hours on a 3090.

[–] xadiant@alien.top 1 points 2 years ago (1 children)

That's acceptable. Did you full train or fine-tune though? And how much data?

[–] AWAS666@alien.top 1 points 2 years ago

Fine tune and around an hour worth of data.

[–] Traditional-Ice-5790@alien.top 1 points 2 years ago

How do you Fine-Tune or full train? I wish there was a step by step guide, I've been trying for hours but I can't figure out what I'm supposed to do. The Readme doesn't explain much.

[–] FPham@alien.top 1 points 2 years ago

Wow!

[–] yahma@alien.top 1 points 2 years ago (1 children)

How fast is the generation? Can it be used real-time?

[–] AWAS666@alien.top 1 points 2 years ago

Very fast, RTF of below 0.1 so processing time is 10x faster than spoken time.

On cpu btw.