Something like https://huggingface.co/spaces/Lin-Chen/ShareGPT4V-7B but that understands audio instead.
Thanks!
This is kinda nuts (first time I try a LLM + vision)
Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.
This is kinda nuts (first time I try a LLM + vision)
Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.