this post was submitted on 14 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I'm pretty knew here so apologies if I'm coming off green with the request ahead of time.

Im looking to see what the best options for running a LVLM (any LLM with visual recognition capabilities like supplying it an image, etc) locally. Bonus points for anything that can also be helpful with video / gif generation

And any (if at all) LM's that do work with sound / voice recognition too that can be run locally.

you are viewing a single comment's thread
view the rest of the comments
[–] Dead_Internet_Theory@alien.top 1 points 10 months ago

The thing is, as far as I'm aware, "sound generation" is always a separate TTS thing cobbled together, and even "vision" is a separate thing that describes the image for the AI.

This 13b model is probably still state of the art in the vision department for open models, a few crop up now and again but they didn't surprise me much.
https://llava-vl.github.io/

If you need to recognize audio, check Whisper, or Faster-Whisper, or anything developed from that. If you need to generate voice, check Bark, maybe Silero, RVC, etc.

You probably won't find it all wrapped into one neat package like ChatGPT+ right now, but I'd love to be proven wrong.