Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[R] FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores (alien.top)

submitted 2 years ago by APaperADay@alien.top to c/machinelearning@academy.garden

6 comments fedilink hide all child comments

Paper: https://arxiv.org/abs/2311.05908

Code: https://github.com/HazyResearch/flash-fft-conv

Blog post: https://hazyresearch.stanford.edu/blog/2023-11-13-flashfftconv

Abstract:

Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in O(NlogN) time in sequence length N but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93× over PyTorch and achieves up to 4.4× speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.

https://preview.redd.it/hys5vkyhm12c1.png?width=1159&format=png&auto=webp&s=62488a2a2d1c3f8e4c0280716605776456519c72

top 6 comments

sorted by: hot top controversial new old

[–] currentscurrents@alien.top 1 points 2 years ago (1 children)

Just built this to try on my CNN and then realized it was only for 1D convolutions. Whoops.

[–] President_Xi_@alien.top 1 points 2 years ago (1 children)

Flatten the image to 1D?

[–] currentscurrents@alien.top 1 points 2 years ago (1 children)

Then you lose the 2D grid structure of the image, which is why you want to use a CNN in the first place.

I think it's possible to apply many of these optimizations to 2D convs as well though. This group is just more interested in language modeling than images.

[–] Raion17@alien.top 1 points 2 years ago

Yeah they built this project because they have one previous work using fft to finish sequence modeling

[–] artsybashev@alien.top 1 points 2 years ago

How was it used in the path-512 task?

[–] Fit-Recognition9795@alien.top 1 points 2 years ago

Wow! Amazing research and results.