this post was submitted on 25 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Hello,

I'm looking for an alternative to Google Vision AI (LABEL_DETECTION, OBJECT_LOCALIZATION) and Amazon Rekognition (DetectLabels).
Any ideas?

Thanks!

top 6 comments
sorted by: hot top controversial new old
[–] Specialist_Ice_5715@alien.top 1 points 10 months ago (1 children)

You'll have to go multi-modal. The best is now fuyu but that's not commercially usable.

[–] takezo07@alien.top 1 points 10 months ago (1 children)

I found Blip: https://replicate.com/salesforce/blip?input=form&output=preview
But that's not exactly what i'm looking for. It does image captioning very well.
Like in the their example: "a woman sitting on the beach with a dog".
But i need a list of objects and "things" like : dog, woman, beach, wave, shirt...etc.

[–] Specialist_Ice_5715@alien.top 1 points 10 months ago

interesting.. is blip commercially usable? I read that it is, but is this correct for the weights in their entirety?

[–] hurrytewer@alien.top 1 points 10 months ago

CogVLM is supposed to support this with prompts like "Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?"

However I couldn't get it to work properly, it would just hallucinate.

If you want to give it a shot here are the official visual QA prompts

[–] Scary-Knowledgable@alien.top 1 points 10 months ago
[–] takezo07@alien.top 1 points 9 months ago

I'm surprised there is not more options....
As there is LLMs for almost everything!