im new here. but is this true multimodality, or is it the llm communicating with a vision model?
and what are those 4 models being benchmark tested here for exactly?
im new here. but is this true multimodality, or is it the llm communicating with a vision model?
and what are those 4 models being benchmark tested here for exactly?