From the issue about this in the exllamav2 repo, quip was using more memory and slower than exl. How much context can you fit?
a_beautiful_rhind
I'm not getting a super huge jump with the bigger models yet. Just a mild bump. I got a P100 to load the low 100s and have exllama work. That's 64g of FP16 using vram.
For bigger I can use FP32 and put back the 2 more P40s. That's 120g of vram. Also 6 vidya cards :P
It required building for this type of system from the start. I'm not made of money either, I just upgrade it over time.
It really is christmas.
I got a P100 for like $150 to see how well it will work with exllama + 3090s and if it is any faster at SD.
These guys are all gone already.
Would be cool to see this in a 34b and 70b.
Aren't there people selling such services to companies here? Implementing RAG, etc.
Heh, 72b with 32k and GQA seems reasonable. Will make for interesting tunes if it's not super restricted.
That's a good sign if anything.
one is not enough
Does it give refusals on base? 67B sounds like full foundation train.
Something is wrong with your environment. even P40s give more than that.
Other option is you don't get enough tokens to get proper t/s speed. What was the total inference time?
Good luck. Centrism is not allowed. You would have to skip the last decade of internet data. Social engineering works for both people and language models much the same.