because there's no obvious way to parallelize the causal self-attention with a FF
You can just use triangular matrices, autoregressive language modelling can be done even with linear only layers. See page 12 of https://arxiv.org/abs/2309.06979
because there's no obvious way to parallelize the causal self-attention with a FF
You can just use triangular matrices, autoregressive language modelling can be done even with linear only layers. See page 12 of https://arxiv.org/abs/2309.06979
Previous discussion of Fast Feed Forward, which is a paper from the author this one is based on.