LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Best LVLM and LM designed for sound generation (alien.top)

submitted 2 years ago by platapus100@alien.top to c/localllama@poweruser.forum

1 comments fedilink hide all child comments

I'm pretty knew here so apologies if I'm coming off green with the request ahead of time.

Im looking to see what the best options for running a LVLM (any LLM with visual recognition capabilities like supplying it an image, etc) locally. Bonus points for anything that can also be helpful with video / gif generation

And any (if at all) LM's that do work with sound / voice recognition too that can be run locally.

top 1 comments

sorted by: hot top controversial new old

[–] Dead_Internet_Theory@alien.top 1 points 2 years ago

The thing is, as far as I'm aware, "sound generation" is always a separate TTS thing cobbled together, and even "vision" is a separate thing that describes the image for the AI.

This 13b model is probably still state of the art in the vision department for open models, a few crop up now and again but they didn't surprise me much.
https://llava-vl.github.io/

If you need to recognize audio, check Whisper, or Faster-Whisper, or anything developed from that. If you need to generate voice, check Bark, maybe Silero, RVC, etc.

You probably won't find it all wrapped into one neat package like ChatGPT+ right now, but I'd love to be proven wrong.