this post was submitted on 23 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 10 months ago
MODERATORS
 

Paper: https://arxiv.org/abs/2311.05908

Code: https://github.com/HazyResearch/flash-fft-conv

Blog post: https://hazyresearch.stanford.edu/blog/2023-11-13-flashfftconv

Abstract:

Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in O(NlogN) time in sequence length N but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93× over PyTorch and achieves up to 4.4× speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.

https://preview.redd.it/hys5vkyhm12c1.png?width=1159&format=png&auto=webp&s=62488a2a2d1c3f8e4c0280716605776456519c72

top 6 comments
sorted by: hot top controversial new old
[–] currentscurrents@alien.top 1 points 10 months ago (1 children)

Just built this to try on my CNN and then realized it was only for 1D convolutions. Whoops.

[–] President_Xi_@alien.top 1 points 10 months ago (1 children)
[–] currentscurrents@alien.top 1 points 10 months ago (1 children)

Then you lose the 2D grid structure of the image, which is why you want to use a CNN in the first place.

I think it's possible to apply many of these optimizations to 2D convs as well though. This group is just more interested in language modeling than images.

[–] Raion17@alien.top 1 points 10 months ago

Yeah they built this project because they have one previous work using fft to finish sequence modeling

[–] artsybashev@alien.top 1 points 10 months ago

How was it used in the path-512 task?

[–] Fit-Recognition9795@alien.top 1 points 10 months ago

Wow! Amazing research and results.