this post was submitted on 09 Nov 2023
1 points (100.0% liked)
Machine Learning
1 readers
1 users here now
Community Rules:
- Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
- Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
- Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
- Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Have you seen this article by Google?
https://arxiv.org/abs/2305.02301
https://blog.research.google/2023/09/distilling-step-by-step-outperforming.html
they claim that they were able to distill for reasoning task PaLM with T5 (2000 times difference in size) and the distilled T5 was outperforming PaLM
code is here:
https://github.com/google-research/distilling-step-by-step
They seem to have distilled knowledge from a larger and general model to a smaller and specialised model and outperform the larger model on single task. Thanks for the paper. I wonder if I can specialise it to a subset of the original tasks and then try to outperform the original model.
I think you can try a similar way for another task, for me, the approach can be generalized to different tasks
for me the approach can be generalized to different tasks
Can you elaborate?