I appreciate your position, but I don't think your intuition holds here, for instance biological neural nets very likely use a qualitatively different learning algorithm than back propagation.
zalperst
joined 11 months ago
I appreciate that it's possible to find a not-illogical explanation (logical would entail a real proof), but it remains surprising to me.
Trillions of tokens, billions of parameters
The sample efficiency you mention is an empirical observation, that doesn't make it not surprising. Why should a single small, noisy, step of gradient descent allow you to immediately memorize the data. I think that's fundamentally surprising.
It's extremely surprising given many instances of data are only seen once or very few times by the model during training
Field is new and work from a year ago is already outdated and consequently so are most textbooks. It's hard to make a fundamentals course even as they continue to change. Best thing to do is read the crap out of the literature and learn pytotch or jax