mgostIH

joined 1 year ago
[โ€“] mgostIH@alien.top 1 points 11 months ago

Previous discussion of Fast Feed Forward, which is a paper from the author this one is based on.

[โ€“] mgostIH@alien.top 1 points 11 months ago

because there's no obvious way to parallelize the causal self-attention with a FF

You can just use triangular matrices, autoregressive language modelling can be done even with linear only layers. See page 12 of https://arxiv.org/abs/2309.06979