Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 1 year ago

MODERATORS

communick@academy.garden

[D] Best method of knowledge distillation available? (alien.top)

submitted 1 year ago by Xanta_Kross@alien.top to c/machinelearning@academy.garden

10 comments fedilink hide all child comments

Best practical method of knowledge distillation available?

TL;DR: Knowledge distillation generally performs worse than traning model from scratch on data from what I've seen online. Is there a method of KD where this doesn't happen and I get close to performance of a model if it was trained from scratch?

So I've recently been interested in make DL models more useful for everyday tasks. And considering their size trying to run these models on consumer devices without much loss in quality but rn from what I've seen, this just feels like trying fit an elephant into his pants.

Basically it tears everytime I try. I found quantization to be cool but I need to reduce its size even more tbh. So I found knowledge distillation. But from what I've seen, though theoretically it is fantastic. Practically knowlege distillation sucks. And is probably worse than just straight up traning the model from scratch on the dataset.

So is there a used and proven method of knowledge distillation that I can use? Which will give me at least very close accuracy to a model trained from scratch on dataset?

top 10 comments

sorted by: hot top controversial new old

[–] NoIdeaAbaout@alien.top 1 points 1 year ago (1 children)

Have you seen this article by Google?

https://arxiv.org/abs/2305.02301

https://blog.research.google/2023/09/distilling-step-by-step-outperforming.html

they claim that they were able to distill for reasoning task PaLM with T5 (2000 times difference in size) and the distilled T5 was outperforming PaLM

code is here:

https://github.com/google-research/distilling-step-by-step

[–] Xanta_Kross@alien.top 1 points 1 year ago (1 children)

They seem to have distilled knowledge from a larger and general model to a smaller and specialised model and outperform the larger model on single task. Thanks for the paper. I wonder if I can specialise it to a subset of the original tasks and then try to outperform the original model.

[–] NoIdeaAbaout@alien.top 1 points 1 year ago (1 children)

I think you can try a similar way for another task, for me, the approach can be generalized to different tasks

[–] Xanta_Kross@alien.top 1 points 1 year ago

for me the approach can be generalized to different tasks

Can you elaborate?

[–] PaganPasta@alien.top 1 points 1 year ago (1 children)

I've used KD and it has always performed better than training from scratch for me.

Could be a data related issue(scarcity, quality)? Or maybe you need to find good hyper-params.

To which domain are you applying it ?

[–] Xanta_Kross@alien.top 1 points 1 year ago (1 children)

NLP. I'm trying to take the llama2 chat and try to compress it down so that it can be ran in a mid-high cpu without losing too much accuracy.

[–] PaganPasta@alien.top 1 points 1 year ago (1 children)

What are your losses/objectives? I'm inclined to say you need a huge amount of data for it as well.

[–] Xanta_Kross@alien.top 1 points 1 year ago

I haven't exactly chosen my specific loss function yet. From what people have told me looking up iBots loss and DinoV2's loss as well as a loss from a paper by Google might be helpful I think. But I might just end up summing multiple loss functions if they're useful and then check if they work.

As for my objective, I don't really have a specific application in my mind rn other than a chatbot of sorts (with moderate-high capability of logic/reasoning) but on my CPU.

Currently this is a rough idea of how I want it to work tbh:

Write a query about what it needs to find the answer for given question from the web
Know when it has obtained the info from the web after looking up the first link and reading its contents otherwise discard it and change query and try again.
After finding it's content answer the asked common sense/logical reasoning question.

E.g: Q. How should I take a rectangle door outside if all I have is a square window? Possible queries:

Can rectangle fit into square?
Rectangles shape
Squares shape
Standard window size
Standard door size Etc

Possible/acceptable answers:

Sorry from what I've seen I couldn't find the answer. (This option would be choosen if the model doesn't find the answer in a limit of n queries)
Rectangles are more general than squares and windows are generally smaller than doors so depending on your exact size you might just be able to fit it through but if the door size and window size are anything standard I don't think you'll be able to fit it through.

[–] fordat1@alien.top 1 points 1 year ago

It depends on your use case and the teacher I think ie is there anything to distill

[–] phree_radical@alien.top 1 points 1 year ago

I liked the SquareHead paper https://arxiv.org/abs/2310.06927