this post was submitted on 27 Oct 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 1 year ago
MODERATORS
 

PAPER: https://arxiv.org/abs/2310.16764

SUMMARY

The paper "ConvNets Match Vision Transformers at Scale" from Google DeepMind aims to debunk the prevalent notion that Vision Transformers (ViTs) are inherently superior to ConvNets for large-scale image classification. Using the NFNet model family as a representative ConvNet architecture, the authors pre-train various models on the extensive JFT-4B dataset under different compute budgets, ranging from 0.4k to 110k TPU-v4 core hours. Through this empirical analysis, they observe a log-log scaling law between held-out loss and compute budget. Importantly, when these NFNets are fine-tuned on ImageNet, they match the performance metrics of ViTs trained under comparable computational constraints. Their most resource-intensive model even achieves a Top-1 ImageNet accuracy of 90.4%.

The crux of the paper's argument is that the supposed performance gap between ConvNets and ViTs largely vanishes under a fair comparison, which accounts for compute and data scale. In other words, the efficacy of a machine learning model in large-scale image classification is more dependent on the available data and computational resources than on the choice between ConvNet and Vision Transformer architectures. This challenges the community's leaning towards ViTs and emphasizes the importance of equitable benchmarking when evaluating different neural network architectures.

top 21 comments
sorted by: hot top controversial new old
[–] Smallpaul@alien.top 1 points 1 year ago (1 children)

The “it” in AI models is the dataset.

... trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.

[–] currentscurrents@alien.top 1 points 1 year ago (3 children)

Maybe it's less about having as many parameters as the human brain, and more about having datasets as rich and diverse as the real world.

[–] TheCrazyAcademic@alien.top 1 points 1 year ago

Well people with mutations like megacephaly which is an enlarged brain aren't any smarter and somehow become even dumber because it messes with neuronal density so we know brain size does not correlate to intelligence at all. Animals with bigger brains meaning more neurons then humans aren't smarter at least in theory, scientists could just be using bad benchmarks.

[–] TikiTDO@alien.top 1 points 1 year ago (1 children)

People talk a lot about datasets being "rich" and "diverse," but I wish they would also mentioned "not full of crap" in the same breath. Whether it be AI or humans, garbage-in, garbage-out still applies. You can have a rich and diverse dataset that teaches AI horrific, terrible ideas and practices.

We know with humans you get a very different effect based on the quality of the teacher and the teaching material, and we know that a bad teacher teaching bad lessons can be even worse than nothing at all. AI isn't really that different.

[–] shanereid1@alien.top 1 points 1 year ago

Was at a big data industry conference yesterday, and one of the big takeaways was that data quality is going to be critical in the age of genAI.

[–] hoppyJonas@alien.top 1 points 11 months ago

It's probably both. In the Chinchilla paper, they showed that for compute-optimal training, the model size and the training dataset size should be proportional.

[–] ewanmcrobert@alien.top 1 points 1 year ago (2 children)

I was going to say vision transformers still have the advantage as they are often pre-trained on unlabelled images. But now I think of it I don't see any reason why you couldn't pre-train a convolutional neural network in the same manner. Just seem to read about it more with vision transformers than CNNs

[–] qalis@alien.top 1 points 1 year ago

That's exactly what ConvNeXt V2 does

[–] mileseverett@alien.top 1 points 1 year ago

Masked Image Modelling objectives are just harder with CNNs compared to ViTs

[–] linearmodality@alien.top 1 points 1 year ago (3 children)

Wasn't this already known? I thought the ConvNeXt paper already showed this a year and a half ago.

[–] RobbinDeBank@alien.top 1 points 1 year ago

This group might have too much TPU credits and don’t know what to with it.

[–] qalis@alien.top 1 points 1 year ago

Yes and no. In my opinion, ConvNeXt is less about data and more about careful architecture design and smart training, and less about data. But yeah, CNNs are better than ViTs if done well, that's true.

[–] That_Flamingo_4114@alien.top 1 points 1 year ago

Not necessarily, a maxxed out perfect conditions system could match the newest developing technology. The papers whole point was that of how you use a technique can matter as much as the algorithm itself. Another paper stating this occurred in the world of recommender systems by Google

[–] Dankmemexplorer@alien.top 1 points 1 year ago (1 children)

isnt the biggest advantage of ViTs that theyre easier to distribute training for?

[–] currentscurrents@alien.top 1 points 1 year ago

The other advantage is multimodality, you can tokenize anything.

[–] ReasonablyBadass@alien.top 1 points 1 year ago (1 children)

The abstract says they trained on a labeled dataset. ViTs work on unlabeled ones, right?

[–] currentscurrents@alien.top 1 points 1 year ago

You can train CNNs on unlabeled data too. Unsupervised learning works with any model type, and diffusion models or VAEs are often CNN-based.

[–] GFrings@alien.top 1 points 1 year ago (1 children)

Has there been a study that performed a deep dive into the opposite end of the spectrum? There are myriad edge applications out there which cannot rely on training a large model and pruning it down for deployment. I wonder which architectures are most suited to learning at small scales.

[–] currentscurrents@alien.top 1 points 1 year ago

Generally, models with stronger inductive biases (like CNNs) work better at small scales - as long as those biases are correct for the kind of data you're working with.

[–] neu_jose@alien.top 1 points 1 year ago (1 children)

I'm saving my excitement for the "fully-connected is all you need" paper, 2026.

[–] Miss-Quiz-Mis@alien.top 1 points 1 year ago

Arent transformers a sort of fully connectef network with weights being dynamic based on the specific input?