this post was submitted on 27 Oct 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 1 year ago
MODERATORS
 

PAPER: https://arxiv.org/abs/2310.16764

SUMMARY

The paper "ConvNets Match Vision Transformers at Scale" from Google DeepMind aims to debunk the prevalent notion that Vision Transformers (ViTs) are inherently superior to ConvNets for large-scale image classification. Using the NFNet model family as a representative ConvNet architecture, the authors pre-train various models on the extensive JFT-4B dataset under different compute budgets, ranging from 0.4k to 110k TPU-v4 core hours. Through this empirical analysis, they observe a log-log scaling law between held-out loss and compute budget. Importantly, when these NFNets are fine-tuned on ImageNet, they match the performance metrics of ViTs trained under comparable computational constraints. Their most resource-intensive model even achieves a Top-1 ImageNet accuracy of 90.4%.

The crux of the paper's argument is that the supposed performance gap between ConvNets and ViTs largely vanishes under a fair comparison, which accounts for compute and data scale. In other words, the efficacy of a machine learning model in large-scale image classification is more dependent on the available data and computational resources than on the choice between ConvNet and Vision Transformer architectures. This challenges the community's leaning towards ViTs and emphasizes the importance of equitable benchmarking when evaluating different neural network architectures.

you are viewing a single comment's thread
view the rest of the comments
[–] linearmodality@alien.top 1 points 1 year ago (3 children)

Wasn't this already known? I thought the ConvNeXt paper already showed this a year and a half ago.

[–] RobbinDeBank@alien.top 1 points 1 year ago

This group might have too much TPU credits and don’t know what to with it.

[–] That_Flamingo_4114@alien.top 1 points 1 year ago

Not necessarily, a maxxed out perfect conditions system could match the newest developing technology. The papers whole point was that of how you use a technique can matter as much as the algorithm itself. Another paper stating this occurred in the world of recommender systems by Google

[–] qalis@alien.top 1 points 1 year ago

Yes and no. In my opinion, ConvNeXt is less about data and more about careful architecture design and smart training, and less about data. But yeah, CNNs are better than ViTs if done well, that's true.