dammit all of the answers are fkin terrible. Looks like the ai bots took over or everyone in this subreddit has become braindead since the blackout.
You obviously don't do W_q @ W_k. That's totally stupid.
What transformers do is (x_i@W_q) @ (x_j@W_k) where x_i and x_j are two tokens in the sequence. This is an interaction operation. This can't be precomputed. What you see noted in the papers is Q = x_i @ W_q, and K = x_j @ W_k.
(Transposes omitted for notational clarity, work that out yourself)
dammit all of the answers are fkin terrible. Looks like the ai bots took over or everyone in this subreddit has become braindead since the blackout.
You obviously don't do W_q @ W_k. That's totally stupid.
What transformers do is (x_i@W_q) @ (x_j@W_k) where x_i and x_j are two tokens in the sequence. This is an interaction operation. This can't be precomputed. What you see noted in the papers is Q = x_i @ W_q, and K = x_j @ W_k.
(Transposes omitted for notational clarity, work that out yourself)