this post was submitted on 25 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

So RWKV 7b v5 is 60% trained now, saw that multilingual parts are better than mistral now, and the english capabilities are close to mistral, except for hellaswag and arc, where its a little behind. all the benchmarks are on rwkv discor, and you can google the pro/cons of rwkv, though most of them are v4.

Thoughts?

you are viewing a single comment's thread
view the rest of the comments
[โ€“] MichalO19@alien.top 1 points 11 months ago (6 children)

If I am reading this RWKV_v5_demo.py right this is essentially a Retentive Network (so a Linear Transformer) but without the positional encoding, with the token shifts from previous RWKVs, and with trainable matrix valued decay factors (instead of fixed decay factors like in RetNet).

Gotta say it's a pretty clean architecture but I will believe it surpasses Mistral when I see it. I don't think a linear transformer has a serious chance to beat a standard transformer with the same number of parameters.

It might have a chance for general 0-shot question answering, but I expect it to be much worse in particular for in-context learning/memory tasks, simply because the softmax attention is way more capable than linear attention as a learning algorithm (theoretically it can learn in-context any key->value mapping, while linear attention by definition can only learn linear key->value mappings (whatever that means in the embedding space), and also risks double-writing into memory things it already knows).

But hey, let's see.

[โ€“] nderstand2grow@alien.top 1 points 11 months ago

your comment is so insightful, thank you. if there are resources I can read/watch to learn about this stuff, I'd be happy if you could share them.

load more comments (5 replies)