this post was submitted on 25 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
If I understood correctly the original explanation on github for RWKV, BlinkDL agrees that softmax attention is very capable in theory but he thinks Transformers are not using it to full potential, so theoretically less capable architectures can beat them.
This might be true, but I kind of doubt it. I played a bit with the 3B RWKV with a prompt like
(note the preferred for RWKV ordering of a question before data, but I tested the other way around too) and unless the query word is very early in the string it gives me a random word. Even 1.3B transformer models seems to answer this correctly much more often (though not always correctly).