mgostIH

joined 1 year ago

[R] Exponentially Faster Language Modelling in c/machinelearning@academy.garden

[–] mgostIH@alien.top 1 points 11 months ago

Previous discussion of Fast Feed Forward, which is a paper from the author this one is based on.

permalink
fedilink
source

[R] Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers in c/machinelearning@academy.garden

[–] mgostIH@alien.top 1 points 11 months ago

because there's no obvious way to parallelize the causal self-attention with a FF

You can just use triangular matrices, autoregressive language modelling can be done even with linear only layers. See page 12 of https://arxiv.org/abs/2309.06979

permalink
fedilink
source
context