As I was asking above, I've been looking at the Fuyu 8b model, and I've been able to break it down to
- model takes in text the regular way, text -> tokens -> embeddings
- it also takes image -> embeddings
- it has a vanilla decoder, so only text comes out, they add special tokens around images, so i'm assuming the decoder ignores output images
So, from what I know, nn.Linear takes in a tensor and makes embeddings of your choice size. I not really sure with everything else though.
- Since the linear layer just makes embeddings, does something like this even need training for the image encoder?
- nn.Linear takes tensors as input, and they split an image into patches, so I'm assuming those patches are made into tensors. How do you turn an image into a tensor? A code snippet of image-embedding-image would be nice if possible
- While Fuyu does not output images, wouldn't the model hidden state be making image or image-like embeddings? Could you generate images if you had an image decoder?