this post was submitted on 22 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
top 17 comments
sorted by: hot top controversial new old
[–] M0ULINIER@alien.top 1 points 10 months ago (2 children)
[–] lakolda@alien.top 1 points 10 months ago

This isn’t comparing with the 13B version of LLAVA. I’d be curious to see that.

[–] justletmefuckinggo@alien.top 1 points 10 months ago

im new here. but is this true multimodality, or is it the llm communicating with a vision model?

and what are those 4 models being benchmark tested here for exactly?

[–] yahma@alien.top 1 points 10 months ago (1 children)

Would love to use this for handling remote security camera footage.

Tried with LLAVA with little success. Has anyone successfully applied any of the Open Vision models to the problem of security?

[–] fallingdowndizzyvr@alien.top 1 points 10 months ago

I just think you have to set proper expectations. I use llava with my security cameras and it does what I want. Which is to know when something interesting is happening like when it sees someone. Llava gave me this from one of my security cameras earlier this morning.

The image features a person walking on a street, captured through a fisheye lens, which distorts the perspective of the scene. The person appears to be carrying a bag, possibly a backpack, while walking down the sidewalk.

Which IMO is very useful.

[–] metalman123@alien.top 1 points 10 months ago

This style of captioning could be amazing for text to image datasets and i wouldn't be surprised to see them take a jump in quality as well.

[–] 9wR8xO@alien.top 1 points 10 months ago

Okay, what front-end can I use to run these type of multi modal models?

[–] StraightChemistry629@alien.top 1 points 10 months ago

This looks good. Imagine this thing quantized. Pretty please u/The-Bloke make it possible.

[–] pseudonerv@alien.top 1 points 10 months ago

Ha, they used data generated by GPT-4V. It's not a surprise that it got better than LLaVA 7B, and is comparable or slightly better than LLaVA 13B.

No innovation needed otherwise!

The ShareGPT4V-7B model follows the design of LLaVA- 1.5 [30], including three integral components: (1) A vision encoder utilizing the CLIP-Large model [45], with a reso- lution of 336×336 and a patch size of 14, converting input images into 576 tokens. (2) A projector, which is a two- layer multi-layer perception (MLP), is introduced to con- nect the vision and language modalities. (3) A LLM, based on the open-source Vicuna-v1.5 [8], derived from LLaMA2 [53].

[–] LoSboccacc@alien.top 1 points 10 months ago

hope, test, wait

...the cycle continues

[–] GeraltOfRiga@alien.top 1 points 10 months ago

This is kinda nuts (first time I try a LLM + vision)

Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.

[–] Lup0Grigi0@alien.top 1 points 10 months ago (2 children)

Has anyone gotten this model working with oogabooga? If so what loader did you use?

[–] Then_Command_5222@alien.top 1 points 10 months ago

Oh please do tell!

[–] beans_fotos_@alien.top 1 points 10 months ago

f so what loader did yo

Tried and did not succeed... waiting on more help to be available.... i have a HORSE of a system and would love trying to run this locally!!!!

[–] durden111111@alien.top 1 points 10 months ago (1 children)

Hopefully we get GGUFs soon

[–] Cradawx@alien.top 1 points 10 months ago (1 children)
[–] durden111111@alien.top 1 points 10 months ago

nice. From my tests it seems to be about the same as LLava v1.5 13B and Bakllava. I'm starting to suspect that the CLIP-Large model all of these multi-model LLMs are using is holding them back.