eigenham

joined 11 months ago
[–] eigenham@alien.top 1 points 9 months ago

I would suggest looking into the math a little more. I think all of the matrices in the attention layer are a (linear) function of the input sequence. So the output of the attention layer is softmax of a quadratic of the input iirc