we_are_mammals

joined 10 months ago
[–] we_are_mammals@alien.top 1 points 9 months ago

Can't OpenAI simply check the output for sharing long substrings with the training data (perhaps probabilistically)?

[–] we_are_mammals@alien.top 1 points 9 months ago (3 children)

It's overfitting.

Overfitting, by definition, happens when your generalization error goes up.

[–] we_are_mammals@alien.top 1 points 9 months ago (5 children)

It's a new conjecture, all right. But it's clearly false.

Consider n=4. Then p=5, q=3, k=1. But 5+1 and 3+1 are not primes.

All you number theorists out there, I think your jobs are safe for the time being.

[–] we_are_mammals@alien.top 1 points 9 months ago

I myself have posted

But the point you were trying to prove was that the discussions were "constant". How does picking your own threads spanning 2 months support it at all?

The OP didn't say that the discussions were completely gone. Yes, there are some, but pretty thin and usually glib. I don't count "Wow! This is exciting. I'll have to take a look at this awesome new paper!" as discussion. A bot harvesting upvotes could post this.

[–] we_are_mammals@alien.top 1 points 9 months ago (2 children)

here constantly.

Fortnightly. Finally got a chance to use this word :-) 4 links spanning 2 months.

But even in these picks, take a look at the first one, for example. 10 comments. Only one of them suggests that the commentator looked at the paper itself.

 

Technical discussions of new research seem to have mostly disappeared in this subreddit, because researchers became a small fraction of its immense readership of 3e6 members.

So I created a subreddit to host such discussions. A "safe space" for researchers, if you will, with strict standards for content^1 . I seeded it with posts about a few recent papers I thought were interesting and my own takes on them, to get the discussion started.

But then I said to myself: "You don't have time to manage a subreddit. WTF are you doing?" and deleted it all. Nevertheless, I'd like to see someone else, perhaps someone with more time, try to do it.


^1: Its main rule was: "No low-effort or low-expertise posts or comments: If your average ML PhD student, or someone with a higher level of expertise wouldn't have posted something, then it does not belong here." Other rules dealt with the format of the posts.

[–] we_are_mammals@alien.top 1 points 9 months ago (1 children)

According to the scaling laws, the loss/error is approximated as

w0 + w1 * pow(num_params, -w2) + w3 * pow(num_tokens, -w4)

Bill wrote before that he'd been meeting with the OpenAI team since 2016, so he's probably pretty knowledgeable about these things. He might be referring to the fact that, after a while, you will see very diminishing returns while increasing num_params. In the limit, the corresponding term disappears, but the others do not.

[–] we_are_mammals@alien.top 1 points 9 months ago

a messed-up experiment or a poorly written/plainly incorrect paper that slips through the review system could be your end

Is that true? If your paper is totally wrong, publish a retraction, do not include the paper in your "list of publications", and move on.

[–] we_are_mammals@alien.top 1 points 10 months ago

Technical discussion seems to be dead in r/MachineLearning, but I'll ask anyway: Isn't it strange that in Figure 3 of the first paper, layer 1 has a blurry diagonal, while the rest of them are sharp? I would have expected the opposite: the lowest layer to be very local, and higher layers to be more global.

[–] we_are_mammals@alien.top 1 points 10 months ago

the claimed 117.83x speedup, might be somewhat misleading

If you compare the best implementation of FFF on CUDA to the best implementation of FF on CUDA, then the speed-up they got is 3.15x:

See Page 5 Further comparisons: "On GPU, the PyTorch BMM implementation of FFF delivers a 3.15x speedup over the fastest (Native fused) implementation of FF"

The 40x that u/lexected mentioned seems to apply only when comparing to an apparently much slower FF version.

It's a pretty cool paper regardless, as far as I can tell from skimming it. But it could benefit from stating more clearly what has been achieved.

[–] we_are_mammals@alien.top 1 points 10 months ago

has 4095 neurons but selectively uses only 12 (0.03%) for inference

an extra 0 in there

[–] we_are_mammals@alien.top 1 points 10 months ago

So the implication here is that the CEO knew about the breakthrough, but hid it from the board?

MSFT did experience a 20% climb over the last month. Maybe it was due to this news leaking out?

[–] we_are_mammals@alien.top 1 points 10 months ago

I think DistilBERT needs to be in Table 2, since it's their most direct competitor: it trades off accuracy for speed, and requires extra training effort, like their approach.

Still, if they are about 20x faster than DistilBERT using cuBLAS, that's pretty amazing.

 

OpenAI announcement:

"We have reached an agreement in principle for Sam to return to OpenAI as CEO with a new initial board of Bret Taylor (Chair), Larry Summers, and Adam D'Angelo.

We are collaborating to figure out the details. Thank you so much for your patience through this."

https://twitter.com/OpenAI/status/1727205556136579362

 

OpenAI announcement:

"We have reached an agreement in principle for Sam to return to OpenAI as CEO with a new initial board of Bret Taylor (Chair), Larry Summers, and Adam D'Angelo.

We are collaborating to figure out the details. Thank you so much for your patience through this."

https://twitter.com/OpenAI/status/1727205556136579362

 

Stability AI is releasing Stable Video Diffusion, their first foundation model for generative video based on the image model Stable Diffusion:

https://stability.ai/news/stable-video-diffusion-open-ai-video-model

 

https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/

It looks like content will have to be labeled, showing if it's AI-generated or not.

And special rules will apply to:

any model that was trained using a quantity of computing power greater than 10^26 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 10^23 integer or floating-point operations; and

any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 10^20 integer or floating-point operations per second for training AI.

Also, easier visas for "AI talent".

view more: next ›