LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

is there any ongoing effort to "bake-in" vision capabilities on top of base models or fine-tunes? (alien.top)

submitted 2 years ago by LyPreto@alien.top to c/localllama@poweruser.forum

2 comments fedilink hide all child comments

have been thinking about this for a while-- does anyone know how feasible this is? Basically just applying some sort of "LoRa" on top of models to give them vision capabilities-- making then multimodal.

top 2 comments

sorted by: hot top controversial new old

[–] mcmoose1900@alien.top 1 points 2 years ago

There's more than one image ingestion model already. Several for llama/mistral.

If you are talking about generating images, I dunno about that. Some people hook up LLMs to prompt stable diffusion, but thats not really the same thing.

[–] Glat0s@alien.top 1 points 2 years ago

https://www.reddit.com/r/LocalLLaMA/search/?q=LLaVA

https://www.reddit.com/r/LocalLLaMA/search/?q=vision