this post was submitted on 29 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 10 months ago
MODERATORS
top 30 comments
sorted by: hot top controversial new old
[–] UnknownEssence@alien.top 1 points 9 months ago

If it is truly memorizing the ENTIRE set of training data, then is it not lossless data compression that is much more efficient than any known compression algorithms?

It has to be lossy compression aka it doesn’t remember its ENTIRE set of training data, word for word.

[–] Zondartul@alien.top 1 points 9 months ago (4 children)

The point of the paper is that LLMs memorize an insane amount of training data and, with some massaging, can be made to output it verbatim. If that training data has PII (personally identifiable information), you're in trouble.

Another big takeaway is that training for more epochs leads to more memorization.

[–] oldjar7@alien.top 1 points 9 months ago (1 children)

How is that a problem? The entire point of training is to memorize and generalize the training data.

[–] narex456@alien.top 1 points 9 months ago

Learning English is not simply memorizing a billion sample sentences.

The problem is that we want it to learn to string words together for itself, not regurgitate words which already appear in the training set in that order.

This paper attempts to solve the difficult dilemma of detecting how much of the success of an llm is due to rote memorization.

Maybe more importantly: how much parameter space/ training resources are wasted on this?

[–] Mandelmus100@alien.top 1 points 9 months ago (4 children)

Another big takeaway is that training for more epochs leads to more memorization.

Should be expected. It's overfitting.

[–] FaceDeer@alien.top 1 points 9 months ago

Indeed. Just like with training humans to be smart, rote memorization sometimes happens but is generally not the goal. Research like this helps avoid it better in future.

[–] Hostilis_@alien.top 1 points 9 months ago

That's not overfitting. That's just fitting.

[–] n_girard@alien.top 1 points 9 months ago

Hopefully I'm not being offtopic here, but a recent paper suggested that repeating a requirement several times within the same instructions lead the model to be more compliant towards it.

Do you know whether it's true or grounded ?

Thanks in advance.

[–] we_are_mammals@alien.top 1 points 9 months ago (1 children)

It's overfitting.

Overfitting, by definition, happens when your generalization error goes up.

[–] DigThatData@alien.top 1 points 9 months ago (2 children)

it's possible to "overfit" to a subset of the data. generalization error going up is a symptom of "overfitting" to the entire dataset. memorization is functionally equivalent to locally overfitting, i.e. generalization error going up in a specific neighborhood of the data. you can have a global reduction in generalization error while also having neighborhoods where generalization gets worse.

[–] Hostilis_@alien.top 1 points 9 months ago

Memorization is functionally equivalent to locally overfitting.

Uh, no it is not. Memorization and overfitting are not the same thing. You are certainly capable of memorizing things without degrading your generalization performance (I hope).

[–] seraphius@alien.top 1 points 9 months ago

On most tasks, memorization would be overfitting, but I think one would see that “overfitting” is task/generalization dependent. As long as accurate predictions are being made for new data, it doesn’t matter that it can cough up the old.

[–] HateRedditCantQuitit@alien.top 1 points 9 months ago

The point isn’t just that they memorize a ton. It’s also that current alignment efforts that purport to prevent regurgitation fail.

[–] Seankala@alien.top 1 points 9 months ago

Nothing about this is novel though; the fact that language models are able to uncover sensitive training information has been a thing for a while now.

[–] blimpyway@alien.top 1 points 9 months ago (2 children)

It is not about being able to search for relevant data when prompted with a question.

The amazing thing is they seem to understand the question sufficiently so the answer is both concise and meaningful.

That's what folks downplaying it as "a glorified autocomplete" are missing.

PS and those philosphising it can't actually understand the question are also missing the point: nobody cares as long as its answers are sufficiently correct and meaningful as if it was understanding the question.

It mimics understanding well enough.

[–] squareOfTwo@alien.top 1 points 9 months ago (1 children)

these things don't "understand". Ask it something which is "to much OOD" and you get wrong answers, even when a human would give the correct answer according to the training set.

[–] blimpyway@alien.top 1 points 9 months ago

I said they mimic understanding well enough, that wasn't a claim LLMs actually understand.

Sure training dataset limits apply,

And sure they very likely fail when the question is OOD, but figuring out the question is OOD isn't that hard, so an honest "Sorry, your question is way too OOD" answer (instead of hallucinating) shouldn't bee too difficult to implement.

[–] zalperst@alien.top 1 points 9 months ago (2 children)

It's extremely surprising given many instances of data are only seen once or very few times by the model during training

[–] cegras@alien.top 1 points 9 months ago (2 children)

What is the size of ChatGPT or the biggest LLMs compared to the dataset? (Not being rhetorical, genuinely curious)

[–] StartledWatermelon@alien.top 1 points 9 months ago

GPT-4: 1.76 trillion parameters, about 6.5* trillion tokens in the dataset.

  • could be twice that, the leaks weren't crystal clear. The above number is more likely though.
[–] zalperst@alien.top 1 points 9 months ago

Trillions of tokens, billions of parameters

[–] gwern@alien.top 1 points 9 months ago (1 children)

It's not surprising at all. The more sample-efficient a model is, the more it can learn a datapoint in a single shot. And that they are often that sample-efficient has been established by tons of previous work.

The value of this work is that it shows that what looked like memorized data from a secret training corpus is memorized data, by checking against an Internet-wide corpus. Otherwise, it's very hard to tell if it's simply a confabulation.

People have been posting screenshots of this stuff on Twitter for ages, but it's usually been impossible to tell if it was real data or just made-up. Similar issues with extracting prompts: you can 'extract a prompt' all you like, but is it the actual prompt? Without some detail like the 'current date' timestamp always being correct, it's hard to tell if what you are getting has anything to do with the actual hidden prompts. (In some cases, it obviously didn't because it was telling the model to do impossible things or describing commands/functionality it didn't have.)

[–] zalperst@alien.top 1 points 9 months ago (1 children)

The sample efficiency you mention is an empirical observation, that doesn't make it not surprising. Why should a single small, noisy, step of gradient descent allow you to immediately memorize the data. I think that's fundamentally surprising.

[–] gwern@alien.top 1 points 9 months ago (3 children)

No, I still think it's not that surprising even taking it as a whole. Humans memorize things all the time after a single look. (Consider, for example, image recognition memory.) If a NN can memorize entire datasets after a few epoches using 'a single small noisy step of gradient descent over 1-4 million tokens' on each datapoint once per epoch, why is saying that some of this memorization happens in the first epoch so surprising? (If it's good enough to memorize given a few steps, then you're just haggling over the price, and 1 step is well within reason.) And there is usually not that much intrinsic information in any of these samples, so if a LLM has done a good job of learning generalizable representations of things like names or phone numbers, it doesn't take up much 'space' inside the LLM to encode yet another slight variation on a human name. (If the representation is good, a 'small' step covers a huge amount of data.)

Plus, you are overegging the description: it's not like it's memorizing 100% of the data on sight, nor is the memorization permanent. (The estimates from earlier papers are more like 1% get memorized at the first epoch, and OP estimates they could extract 1GB of text from GPT-3/4, which sounds roughly consistent.) So it's more like, 'once every great once in a while, particularly if a datapoint was very recently seen or simple or stereotypical, the model can mostly recall having seen it before'.

[–] zalperst@alien.top 1 points 9 months ago

I appreciate your position, but I don't think your intuition holds here, for instance biological neural nets very likely use a qualitatively different learning algorithm than back propagation.

[–] zalperst@alien.top 1 points 9 months ago

I appreciate that it's possible to find a not-illogical explanation (logical would entail a real proof), but it remains surprising to me.

[–] ThirdMover@alien.top 1 points 9 months ago

Humans memorize things all the time after a single look.

I think what's going on in humans there is a lot more complex than something like a single SGD step updating some weights. Generally if you do memorize something you replay it in your head consciously several times.

[–] MuonManLaserJab@alien.top 1 points 9 months ago

I think the keyword is "just".

[–] exomni@alien.top 1 points 9 months ago

The operative word here is "just". The models are so large and the training is such that of course one of the things they are likely doing is memorizing the corpus; but they aren't "just" memorizing the corpus: there is some amount of regularization in place to allow the system to exhibit more generative outputs and behaviors as well.