this post was submitted on 29 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 10 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] zalperst@alien.top 1 points 9 months ago (1 children)

The sample efficiency you mention is an empirical observation, that doesn't make it not surprising. Why should a single small, noisy, step of gradient descent allow you to immediately memorize the data. I think that's fundamentally surprising.

[–] gwern@alien.top 1 points 9 months ago (3 children)

No, I still think it's not that surprising even taking it as a whole. Humans memorize things all the time after a single look. (Consider, for example, image recognition memory.) If a NN can memorize entire datasets after a few epoches using 'a single small noisy step of gradient descent over 1-4 million tokens' on each datapoint once per epoch, why is saying that some of this memorization happens in the first epoch so surprising? (If it's good enough to memorize given a few steps, then you're just haggling over the price, and 1 step is well within reason.) And there is usually not that much intrinsic information in any of these samples, so if a LLM has done a good job of learning generalizable representations of things like names or phone numbers, it doesn't take up much 'space' inside the LLM to encode yet another slight variation on a human name. (If the representation is good, a 'small' step covers a huge amount of data.)

Plus, you are overegging the description: it's not like it's memorizing 100% of the data on sight, nor is the memorization permanent. (The estimates from earlier papers are more like 1% get memorized at the first epoch, and OP estimates they could extract 1GB of text from GPT-3/4, which sounds roughly consistent.) So it's more like, 'once every great once in a while, particularly if a datapoint was very recently seen or simple or stereotypical, the model can mostly recall having seen it before'.

[–] zalperst@alien.top 1 points 9 months ago

I appreciate that it's possible to find a not-illogical explanation (logical would entail a real proof), but it remains surprising to me.

[–] zalperst@alien.top 1 points 9 months ago

I appreciate your position, but I don't think your intuition holds here, for instance biological neural nets very likely use a qualitatively different learning algorithm than back propagation.

[–] ThirdMover@alien.top 1 points 9 months ago

Humans memorize things all the time after a single look.

I think what's going on in humans there is a lot more complex than something like a single SGD step updating some weights. Generally if you do memorize something you replay it in your head consciously several times.