this post was submitted on 28 Oct 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 1 year ago
MODERATORS
 

I am currently in my last year of undergrad, and about to begin my direct PhD in Multi-Modality AI next year. I have been in the community of deep learning & NLP about 2 years. And I have witnessed the development of Transformers, from the simple GPT&BERT to nowadays' billions of parameters' monsters with 'gold crown' on the top of deep learning world.

I have spent a lot of time with T5 Model, and its paper(Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, paper which I love very much!), trying to find an efficient way for usage of LLMs. I have handwritten Adapter Layer and LoRA to fine-tune T5 on Glue & SuperGlue. I have also tried out multiple fancy instruction fine-tuning LLMs like LLaMA and QWen.

And this year earlier, I have noticed the wonder of Multi-Modality, and quickly fallen in love with it, which has now become my PhD focus. I have followed recent years' Multi-Modality development, especially CLIP and its follow-up works. And LLMs play quite an important role in today's vision-language model, say, BLIP2 and LLaVA.

I believe due to the computational gap between schools and huge companies, the focus of my PhD career should be on Efficient Learning. I am also trying to enhance VLM through Retrieval-Augments. The target dataset may be Encyclopedic VQA, for even Large VLM failed to perform well on it, which could potentially be solved through Retrieval-Augmented VLM.

I would like to hear any suggestions from you, including work-life balance, the direction of my academic focus and so on , which I would treasure very much in my new explorations of life stage. (I am currently doing RAG Question Answering Chat Bot for a company for my internship, and I would like to get suggestions from that too!)

And are there subreddits like here?(I am also a member of LocalLLaMA, both subreddits benefits me a lot!)

you are viewing a single comment's thread
view the rest of the comments
[–] damhack@alien.top 1 points 1 year ago

How about exploring a framework for synchronizing time-variant multi-modal inputs so that multiple “senses” can be associated with an “event” that can be treated as an inference object. Extend this to simulating synesthesia too and you have a powerful approach for feeding cognition in AGI and robotics.