I found out about this model browsing LLaMA-Adapter repo, it was released a few days ago.
Model page
Weights (40GB)
Paper
Demo
Seems to be able to handle different tasks on images such as bounding box and object-detection, text extraction. On benchmarks it shows a bit lower numbers than CogVLM, so I tried to test how well it can reason and compare it to CogVLM, I was able to get good results with SPHINX consistently, with higher temperature while CogVLM was missing the point with any configuration:
CogVLM
SPHINX
Thanks a lot for converting and quantizing these. I have a couple of questions.
How does it compare to ALMA? (13B)
Is it capable of translating more than 1 sentence at a time?
Is there a way to specify source language or does it always detect it on its own?