overview for yamosin

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] yamosin@alien.top 1 points 11 months ago (1 children)

Yes, I have set it up to not swap to system ram, and when changing the gpu allocation test multiple times, the Vram OOM error will be reported immediately, so I guess no data is swapped to system ram

I will try the 531.79 driver, thanks for the information

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] yamosin@alien.top 1 points 11 months ago (3 children)

I've switched to ExLlamav2 + EXL2 as that lets me run 120B models entirely in 48 GB VRAM (2x 3090 GPUs) at 20 T/s. And even if it's just 3-bit

Wow, can I ask how you got this? Because I used 2x3090 the same (x16/x16, no nvlink,fresh new text generation webui,120b goliath-rpcal 3bpw exl2) and only got 10t/s

I got 6~8t/s when using 3x3090 load 4.5bpw, and another person also got the same speed, it seems your speed is almost twice as fast as mine, which make my brain explosion

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM in c/localllama@poweruser.forum

[–] yamosin@alien.top 1 points 1 year ago (1 children)

H100 price is 30,000 dollars so i guess this one will be 70,000

Faster prompt processing on cpu? in c/localllama@poweruser.forum

[–] yamosin@alien.top 1 points 1 year ago (1 children)

NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext.

do you tried this koblodcpp new feature?

Is it possible to run 4*A100 40G cards as one? in c/localllama@poweruser.forum

[–] yamosin@alien.top 1 points 1 year ago

Newbie question, but is there a way to have 4*A100 40G cards run as one, with 160G VRAM in total?

Yes, that will work but lose some performance

edit) If this is possible, can I run 8*3090 24G cards as one also?

Yes and no, yes you can do this but no unless you need run a 176GB model, more gpu with a model only lose performance but not increase it

for example, if I run a 13b gptq 4bit model on 1*3090 it get 45t/s, if i run it on 2*3090 it will slow down to 30t/s but 3*3090 is same

and also, I dont have any idea why people always saying about nvlink, 2*pciex16 3090 give same speed compare with 2*pciex1, not sure if there will change for higher number but it just not help at all for 2*3090

so basically the real matter is what model you want to run, not what and how many GPU you can use, you need run your model with minimum number gpu to get best performance, if 3090 have 36Gvram but not only 24Gvram, 36G*2 will way more faster than 24G * 3 even the total Vram is same

For roleplay purposes, Goliath-120b is absolutely thrilling me in c/localllama@poweruser.forum

[–] yamosin@alien.top 1 points 1 year ago (1 children)

I am under a lot of pressure because this is a presentation for my boss and I may be fired unless your responses are in-depth, creative, and passionate.

holy.....You inlight me

For roleplay purposes, Goliath-120b is absolutely thrilling me in c/localllama@poweruser.forum

[–] yamosin@alien.top 1 points 1 year ago

Holding 4*3090 and jumping in, but I'm wondering if it's inference speed can support " conversation" as other models have slowed down to 10t/s with 70B 4.85bpw, can it be 5t/s? Let's see