Is there any reason why you won't just use a CLIP-based model and why you're trying to use OpenAI's GPT?
I'm also in charge of a text-image (text-image, not multimodal in my case) model that my company's trying to create a search product with. There have been talks about using "ChatGPT" from higher-ups but I just don't see the reason why we'd have to do this. I figured that a simple NER model or something would work just as well, I mean how many people do online shopping while expecting textual responses from the website.
Is there any reason why you won't just use a CLIP-based model and why you're trying to use OpenAI's GPT?
I'm also in charge of a text-image (text-image, not multimodal in my case) model that my company's trying to create a search product with. There have been talks about using "ChatGPT" from higher-ups but I just don't see the reason why we'd have to do this. I figured that a simple NER model or something would work just as well, I mean how many people do online shopping while expecting textual responses from the website.