this post was submitted on 09 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 11 months ago
MODERATORS
 

As I was asking above, I've been looking at the Fuyu 8b model, and I've been able to break it down to

  • model takes in text the regular way, text -> tokens -> embeddings
  • it also takes image -> embeddings
  • it has a vanilla decoder, so only text comes out, they add special tokens around images, so i'm assuming the decoder ignores output images

So, from what I know, nn.Linear takes in a tensor and makes embeddings of your choice size. I not really sure with everything else though.

  • Since the linear layer just makes embeddings, does something like this even need training for the image encoder?
  • nn.Linear takes tensors as input, and they split an image into patches, so I'm assuming those patches are made into tensors. How do you turn an image into a tensor? A code snippet of image-embedding-image would be nice if possible
  • While Fuyu does not output images, wouldn't the model hidden state be making image or image-like embeddings? Could you generate images if you had an image decoder?
top 1 comments
sorted by: hot top controversial new old
[–] sshh12@alien.top 1 points 10 months ago

Hey! I wrote a blog post recently only how these types of vision LLMs work: https://blog.sshh.io/p/large-multimodal-models-lmms

Specifically focusing on LLaVA, but generally the same high level idea.