LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Any alternatives to couqi for TTS? (alien.top)

submitted 2 years ago by enterguild@alien.top to c/localllama@poweruser.forum

14 comments fedilink hide all child comments

Hey guys,

So TLDR is elevenlabs / play.ht is WAY too expensive for a realtime chat app, and we need an alternative. Guessing this is why character is rolling their own voice model, & obviously most apps can't do that, so what are the alternatives here?

I've read zero shot prompting for TTS (inserting a sample at runtime) is part of the reason elevenlabs / play is so expensive, wheras finetuning on individual voices like character / OAI did and hosting those as their own model would be way faster and cheaper.

But couqi seems really slow from our finetune testing, even on an h100, and not only that but it's not really... good. Does anyone know why, or there alternatives that chat apps are using? Is anyone working on better open source TTS? This seems totally overlooked compared to text where there's so much competition right now, but is almost just as important. Shocked more people aren't working on this! Thanks

top 14 comments

sorted by: hot top controversial new old

[–] kn4-@alien.top 1 points 2 years ago

StyleTTS 2 sounds pretty good.

Github

Samples

Localllama Post

[–] UnoriginalScreenName@alien.top 1 points 2 years ago

I just started down this rabbit hole and have a lot of questions. If you dont care about real time inference and just want a high quality voice clone, what's the best option? I'm looking to do semi dynamic narration over video.

[–] Kimononono@alien.top 1 points 2 years ago

tortoiseTTS using the voice-ai-cloning repository. Had a dataset of 20 minutes, 5 minutes of footage along with a hour of tweaking the hyper parameters and i have a voice which sounds pretty damn human. I tried training for a long time but just sounds worse after the first few epochs

[–] a_beautiful_rhind@alien.top 1 points 2 years ago

https://huggingface.co/spaces/Plachta/VALL-E-X

[–] LyPreto@alien.top 1 points 2 years ago

the licensing on this blows but they have a very unique model IMO: StyleTTS

it picks up the appropriate voice/intonation according to the text which i personally haven’t seen being done yet!

[–] Regular_Instruction@alien.top 1 points 2 years ago

StyleTTS2, why not use Coqui TTS, xtts2 ?
Also VITS are kinda great voices

[–] altoidsjedi@alien.top 1 points 2 years ago

VITS2 was published recently by the authors of VITS. If I understand correctly, it implements the use of transformers, runs more efficiently than VITS, and is capable of better voices too, provided the dataset. some folks make an open source implementation of it, with the help of the authors of the paper. See the GitHub repo

[–] charactr_Pat@alien.top 1 points 2 years ago

not open source but ResembleAI and GemeloAI are good real-time TTS options via API, although not free

[–] harrro@alien.top 1 points 2 years ago (1 children)

Which coqui model did you use? The new xtts2 model is excellent IMO.

[–] PacmanIncarnate@alien.top 1 points 2 years ago

And fast. Not sure they’ll find something better.

[–] Scary-Knowledgable@alien.top 1 points 2 years ago

Nvidia Riva - https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html

[–] Temsirolimus555@alien.top 1 points 2 years ago

XTTS2 sounds acceptably good to me, even comparable to elevenlabs in some respects.

[–] reallmconnoisseur@alien.top 1 points 2 years ago

I haven't tested it myself yet, but this was shared here couple of days ago as well: EmotiVoice

[–] LuluViBritannia@alien.top 1 points 2 years ago

Silero TTS is extremely fast, and combined with RVC you can clone any voice from any person/character. It's a bit monotonous, but it's the best available for free imo.

And if you want the best quality : use the 10000 free words per month of your 11Labs account. Once you run out of it, switch to Silero TTS. In both cases, plug the audio output into the input of a real-time RVC app.