this post was submitted on 14 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
The thing is, as far as I'm aware, "sound generation" is always a separate TTS thing cobbled together, and even "vision" is a separate thing that describes the image for the AI.
This 13b model is probably still state of the art in the vision department for open models, a few crop up now and again but they didn't surprise me much.
https://llava-vl.github.io/
If you need to recognize audio, check Whisper, or Faster-Whisper, or anything developed from that. If you need to generate voice, check Bark, maybe Silero, RVC, etc.
You probably won't find it all wrapped into one neat package like ChatGPT+ right now, but I'd love to be proven wrong.