this post was submitted on 22 Nov 2023

1 points (100.0% liked)

LocalLLaMA

1 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago

MODERATORS

communick@poweruser.forum

ShareGPT4V - New multi-modal model, improves on LLaVA (sharegpt4v.github.io)

submitted 10 months ago by Cradawx@alien.top to c/localllama@poweruser.forum

17 comments fedilink hide all child comments

top 17 comments

sorted by: hot top controversial new old

[–] M0ULINIER@alien.top 1 points 10 months ago (2 children)

https://preview.redd.it/vnony8f0ax1c1.png?width=1080&format=pjpg&auto=webp&s=dc261252751a0a1e209d9049854895688de25fa4

Benchmark in their GitHub, even if it's hard to be sure in current times

[–] lakolda@alien.top 1 points 10 months ago

This isn’t comparing with the 13B version of LLAVA. I’d be curious to see that.

[–] justletmefuckinggo@alien.top 1 points 10 months ago

im new here. but is this true multimodality, or is it the llm communicating with a vision model?

and what are those 4 models being benchmark tested here for exactly?

[–] yahma@alien.top 1 points 10 months ago (1 children)

Would love to use this for handling remote security camera footage.

Tried with LLAVA with little success. Has anyone successfully applied any of the Open Vision models to the problem of security?

[–] fallingdowndizzyvr@alien.top 1 points 10 months ago

I just think you have to set proper expectations. I use llava with my security cameras and it does what I want. Which is to know when something interesting is happening like when it sees someone. Llava gave me this from one of my security cameras earlier this morning.

The image features a person walking on a street, captured through a fisheye lens, which distorts the perspective of the scene. The person appears to be carrying a bag, possibly a backpack, while walking down the sidewalk.

Which IMO is very useful.

[–] metalman123@alien.top 1 points 10 months ago

This style of captioning could be amazing for text to image datasets and i wouldn't be surprised to see them take a jump in quality as well.

[–] 9wR8xO@alien.top 1 points 10 months ago

Okay, what front-end can I use to run these type of multi modal models?

[–] StraightChemistry629@alien.top 1 points 10 months ago

This looks good. Imagine this thing quantized. Pretty please u/The-Bloke make it possible.

[–] pseudonerv@alien.top 1 points 10 months ago

Ha, they used data generated by GPT-4V. It's not a surprise that it got better than LLaVA 7B, and is comparable or slightly better than LLaVA 13B.

No innovation needed otherwise!

The ShareGPT4V-7B model follows the design of LLaVA- 1.5 [30], including three integral components: (1) A vision encoder utilizing the CLIP-Large model [45], with a reso- lution of 336×336 and a patch size of 14, converting input images into 576 tokens. (2) A projector, which is a two- layer multi-layer perception (MLP), is introduced to con- nect the vision and language modalities. (3) A LLM, based on the open-source Vicuna-v1.5 [8], derived from LLaMA2 [53].

[–] LoSboccacc@alien.top 1 points 10 months ago

hope, test, wait

...the cycle continues

[–] GeraltOfRiga@alien.top 1 points 10 months ago

This is kinda nuts (first time I try a LLM + vision)

Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.

[–] Lup0Grigi0@alien.top 1 points 10 months ago (2 children)

Has anyone gotten this model working with oogabooga? If so what loader did you use?

[–] Then_Command_5222@alien.top 1 points 10 months ago

Oh please do tell!

[–] beans_fotos_@alien.top 1 points 10 months ago

f so what loader did yo

Tried and did not succeed... waiting on more help to be available.... i have a HORSE of a system and would love trying to run this locally!!!!

[–] durden111111@alien.top 1 points 10 months ago (1 children)

Hopefully we get GGUFs soon

[–] Cradawx@alien.top 1 points 10 months ago (1 children)

I converted and quantized this to work in llama.cpp

https://huggingface.co/nakodanei/ShareGPT4V-7B_GGUF

[–] durden111111@alien.top 1 points 10 months ago

nice. From my tests it seems to be about the same as LLava v1.5 13B and Bakllava. I'm starting to suspect that the CLIP-Large model all of these multi-model LLMs are using is holding them back.