Machine Learning

1 readers

1 users here now

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

submitted 2 years ago by vatsadev@alien.top to c/machinelearning@academy.garden

1 comments fedilink hide all child comments

As I was asking above, I've been looking at the Fuyu 8b model, and I've been able to break it down to

model takes in text the regular way, text -> tokens -> embeddings
it also takes image -> embeddings
it has a vanilla decoder, so only text comes out, they add special tokens around images, so i'm assuming the decoder ignores output images

So, from what I know, nn.Linear takes in a tensor and makes embeddings of your choice size. I not really sure with everything else though.

Since the linear layer just makes embeddings, does something like this even need training for the image encoder?
nn.Linear takes tensors as input, and they split an image into patches, so I'm assuming those patches are made into tensors. How do you turn an image into a tensor? A code snippet of image-embedding-image would be nice if possible
While Fuyu does not output images, wouldn't the model hidden state be making image or image-like embeddings? Could you generate images if you had an image decoder?

top 1 comments

sorted by: hot top controversial new old

[–] sshh12@alien.top 1 points 2 years ago

Hey! I wrote a blog post recently only how these types of vision LLMs work: https://blog.sshh.io/p/large-multimodal-models-lmms

Specifically focusing on LLaVA, but generally the same high level idea.