Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[R] ConvNets Match Vision Transformers at Scale (alien.top)

submitted 2 years ago by psyyduck@alien.top to c/machinelearning@academy.garden

21 comments fedilink hide all child comments

PAPER: https://arxiv.org/abs/2310.16764

SUMMARY

The paper "ConvNets Match Vision Transformers at Scale" from Google DeepMind aims to debunk the prevalent notion that Vision Transformers (ViTs) are inherently superior to ConvNets for large-scale image classification. Using the NFNet model family as a representative ConvNet architecture, the authors pre-train various models on the extensive JFT-4B dataset under different compute budgets, ranging from 0.4k to 110k TPU-v4 core hours. Through this empirical analysis, they observe a log-log scaling law between held-out loss and compute budget. Importantly, when these NFNets are fine-tuned on ImageNet, they match the performance metrics of ViTs trained under comparable computational constraints. Their most resource-intensive model even achieves a Top-1 ImageNet accuracy of 90.4%.

The crux of the paper's argument is that the supposed performance gap between ConvNets and ViTs largely vanishes under a fair comparison, which accounts for compute and data scale. In other words, the efficacy of a machine learning model in large-scale image classification is more dependent on the available data and computational resources than on the choice between ConvNet and Vision Transformer architectures. This challenges the community's leaning towards ViTs and emphasizes the importance of equitable benchmarking when evaluating different neural network architectures.

you are viewing a single comment's thread
view the rest of the comments

[–] Smallpaul@alien.top 1 points 2 years ago (5 children)

The “it” in AI models is the dataset.

... trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.

[–] currentscurrents@alien.top 1 points 2 years ago (4 children)

Maybe it's less about having as many parameters as the human brain, and more about having datasets as rich and diverse as the real world.

[–] TikiTDO@alien.top 1 points 2 years ago (1 children)

People talk a lot about datasets being "rich" and "diverse," but I wish they would also mentioned "not full of crap" in the same breath. Whether it be AI or humans, garbage-in, garbage-out still applies. You can have a rich and diverse dataset that teaches AI horrific, terrible ideas and practices.

We know with humans you get a very different effect based on the quality of the teacher and the teaching material, and we know that a bad teacher teaching bad lessons can be even worse than nothing at all. AI isn't really that different.

[–] shanereid1@alien.top 1 points 2 years ago

Was at a big data industry conference yesterday, and one of the big takeaways was that data quality is going to be critical in the age of genAI.

load more comments (2 replies)