LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

RWKV v5 7b, Fully Open-Source, 60% trained, approaching Mistral 7b in abilities or surpassing it. (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/localllama@poweruser.forum

32 comments fedilink hide all child comments

So RWKV 7b v5 is 60% trained now, saw that multilingual parts are better than mistral now, and the english capabilities are close to mistral, except for hellaswag and arc, where its a little behind. all the benchmarks are on rwkv discor, and you can google the pro/cons of rwkv, though most of them are v4.

Thoughts?

you are viewing a single comment's thread
view the rest of the comments

[–] MichalO19@alien.top 1 points 2 years ago (5 children)

If I am reading this RWKV_v5_demo.py right this is essentially a Retentive Network (so a Linear Transformer) but without the positional encoding, with the token shifts from previous RWKVs, and with trainable matrix valued decay factors (instead of fixed decay factors like in RetNet).

Gotta say it's a pretty clean architecture but I will believe it surpasses Mistral when I see it. I don't think a linear transformer has a serious chance to beat a standard transformer with the same number of parameters.

It might have a chance for general 0-shot question answering, but I expect it to be much worse in particular for in-context learning/memory tasks, simply because the softmax attention is way more capable than linear attention as a learning algorithm (theoretically it can learn in-context any key->value mapping, while linear attention by definition can only learn linear key->value mappings (whatever that means in the embedding space), and also risks double-writing into memory things it already knows).

But hey, let's see.

[–] Maykey@alien.top 1 points 2 years ago

I don't think a linear transformer has a serious chance to beat a standard transformer with the same number of parameters.

I do. Transformers are not good on long range area.. They perform well only if they are backed by better architectures as in case of MEGA.

[–] cztomsik@alien.top 1 points 2 years ago

I have my doubts too. RWKV4 was great, but in practice it was always worse than any LLAMA. I think it might be because it's way more sensitive to sampling. Because every token destroys the previous state completely. So once it goes wrong way, it will never recover. This happens with other architectures too but all the data are still in the context and the model can still recover but RWKV does not have any (previous) context, so it can't recover.

That said, RWKV is awesome and I am super-excited about it. Either we can solve this problem with sampling or we can just slap small attention block on top of it and do fine-tuning then together. Either way, the future is bright in my opinion.

Also, if you think about it, it's a miracle that such architecture even works and manages to learn instruction following.

Also RWKV is great because you can "freeze" the state, save it, and then always just restore it, and continue the conversation (or whatever). Which together with small memory requirements makes it very compelling for serving multiple users without occupying a lot of GPU memory, and also instead of "engineering the prompt" you are really engineering the initial state. Obviously it's way more sensitive to fine-tuning, it will "revert" to its mood sooner.

[–] nderstand2grow@alien.top 1 points 2 years ago

your comment is so insightful, thank you. if there are resources I can read/watch to learn about this stuff, I'd be happy if you could share them.

[–] vatsadev@alien.top 1 points 2 years ago (1 children)

Hmm, will have to check this stuff with the people on the rwkv discord server.

V5 is stable at context usage, and V6 is trying to get better at using the context, so we might see improvement on this

[–] MichalO19@alien.top 1 points 2 years ago

If I understood correctly the original explanation on github for RWKV, BlinkDL agrees that softmax attention is very capable in theory but he thinks Transformers are not using it to full potential, so theoretically less capable architectures can beat them.

This might be true, but I kind of doubt it. I played a bit with the 3B RWKV with a prompt like

User: What is the word directly after "bread" in the following string "[like 20 random words]" 
Assistant: The word directly after "bread" is "

(note the preferred for RWKV ordering of a question before data, but I tested the other way around too) and unless the query word is very early in the string it gives me a random word. Even 1.3B transformer models seems to answer this correctly much more often (though not always correctly).

[–] Hey_You_Asked@alien.top 1 points 2 years ago

what a fantastic comment, thank you