tdgros

joined 1 year ago
[–] tdgros@alien.top 1 points 11 months ago (1 children)

It's the same mathematically but not computation wise, the tokens are projected to a smaller dimension. The complexity is 2Nd whereas it'd be N² if you'd fuse the weight matrices.