this post was submitted on 14 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I'm pretty knew here so apologies if I'm coming off green with the request ahead of time.

Im looking to see what the best options for running a LVLM (any LLM with visual recognition capabilities like supplying it an image, etc) locally. Bonus points for anything that can also be helpful with video / gif generation

And any (if at all) LM's that do work with sound / voice recognition too that can be run locally.

top 1 comments
sorted by: hot top controversial new old
[–] Dead_Internet_Theory@alien.top 1 points 11 months ago

The thing is, as far as I'm aware, "sound generation" is always a separate TTS thing cobbled together, and even "vision" is a separate thing that describes the image for the AI.

This 13b model is probably still state of the art in the vision department for open models, a few crop up now and again but they didn't surprise me much.
https://llava-vl.github.io/

If you need to recognize audio, check Whisper, or Faster-Whisper, or anything developed from that. If you need to generate voice, check Bark, maybe Silero, RVC, etc.

You probably won't find it all wrapped into one neat package like ChatGPT+ right now, but I'd love to be proven wrong.