It's overfitting.
Overfitting, by definition, happens when your generalization error goes up.
It's overfitting.
Overfitting, by definition, happens when your generalization error goes up.
It's a new conjecture, all right. But it's clearly false.
Consider n=4. Then p=5, q=3, k=1. But 5+1 and 3+1 are not primes.
All you number theorists out there, I think your jobs are safe for the time being.
I myself have posted
But the point you were trying to prove was that the discussions were "constant". How does picking your own threads spanning 2 months support it at all?
The OP didn't say that the discussions were completely gone. Yes, there are some, but pretty thin and usually glib. I don't count "Wow! This is exciting. I'll have to take a look at this awesome new paper!" as discussion. A bot harvesting upvotes could post this.
here constantly.
Fortnightly. Finally got a chance to use this word :-) 4 links spanning 2 months.
But even in these picks, take a look at the first one, for example. 10 comments. Only one of them suggests that the commentator looked at the paper itself.
According to the scaling laws, the loss/error is approximated as
w0 + w1 * pow(num_params, -w2) + w3 * pow(num_tokens, -w4)
Bill wrote before that he'd been meeting with the OpenAI team since 2016, so he's probably pretty knowledgeable about these things. He might be referring to the fact that, after a while, you will see very diminishing returns while increasing num_params
. In the limit, the corresponding term disappears, but the others do not.
a messed-up experiment or a poorly written/plainly incorrect paper that slips through the review system could be your end
Is that true? If your paper is totally wrong, publish a retraction, do not include the paper in your "list of publications", and move on.
Technical discussion seems to be dead in r/MachineLearning, but I'll ask anyway: Isn't it strange that in Figure 3 of the first paper, layer 1 has a blurry diagonal, while the rest of them are sharp? I would have expected the opposite: the lowest layer to be very local, and higher layers to be more global.
the claimed 117.83x speedup, might be somewhat misleading
If you compare the best implementation of FFF on CUDA to the best implementation of FF on CUDA, then the speed-up they got is 3.15x:
See Page 5 Further comparisons: "On GPU, the PyTorch BMM implementation of FFF delivers a 3.15x speedup over the fastest (Native fused) implementation of FF"
The 40x that u/lexected mentioned seems to apply only when comparing to an apparently much slower FF version.
It's a pretty cool paper regardless, as far as I can tell from skimming it. But it could benefit from stating more clearly what has been achieved.
has 4095 neurons but selectively uses only 12 (0.03%) for inference
an extra 0
in there
So the implication here is that the CEO knew about the breakthrough, but hid it from the board?
MSFT did experience a 20% climb over the last month. Maybe it was due to this news leaking out?
I think DistilBERT needs to be in Table 2, since it's their most direct competitor: it trades off accuracy for speed, and requires extra training effort, like their approach.
Still, if they are about 20x faster than DistilBERT using cuBLAS, that's pretty amazing.
Can't OpenAI simply check the output for sharing long substrings with the training data (perhaps probabilistically)?