eigenham

joined 11 months ago

[D]In transformer models, why is there a query and key matrix instead of just the product? in c/machinelearning@academy.garden

[–] eigenham@alien.top 1 points 9 months ago

I would suggest looking into the math a little more. I think all of the matrices in the attention layer are a (linear) function of the input sequence. So the output of the attention layer is softmax of a quadratic of the input iirc

permalink
fedilink
source