Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[R] Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers (alien.top)

submitted 2 years ago by APaperADay@alien.top to c/machinelearning@academy.garden

2 comments fedilink hide all child comments

Paper: https://arxiv.org/abs/2311.10642

Code: https://github.com/vulus98/Rethinking-attention

Abstract:

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.

top 2 comments

sorted by: hot top controversial new old

[–] ganzzahl@alien.top 1 points 2 years ago (1 children)

The reasons this isn't done:

Fixed max sequence length (shorter sequences aren't less computation)
Very short max sequence length (50 tokens in this paper!)
Very inefficient training (for a target sequence with N tokens, this requires N forward passes for the decoder, as opposed to 1 with attention, because there's no obvious way to parallelize the causal self-attention with a FF

[–] mgostIH@alien.top 1 points 2 years ago

because there's no obvious way to parallelize the causal self-attention with a FF

You can just use triangular matrices, autoregressive language modelling can be done even with linear only layers. See page 12 of https://arxiv.org/abs/2309.06979