LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Question about GGUF, gpu offload and performance (alien.top)

submitted 2 years ago by Jokaiser2000@alien.top to c/localllama@poweruser.forum

7 comments fedilink hide all child comments

Hi. I'm currently running a 3060 12Gb | R7 2700X | 32gb 3200 | Windows 10 w/ latests nvidia drivers (vram>ram overflow disabled). By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Are these values what is expected of my setup? Or is there something i can do to improve speeds without changing the model?

Its pretty much unusable at this state, and since it's hard to find information about this topic i figured i would try to ask here.

EDIT: running the model on the latest version of the text-generation-webui

you are viewing a single comment's thread
view the rest of the comments

[–] Desm0nt@alien.top 1 points 2 years ago

By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Sound suspicious. A use Yi-Chat-34b-Q4_K_M on old 1080ti (11 gb VRAM) with 20 layers offloaded and got around 2.5 t/s.But it is on Threadripper 2920 with 4 channel RAM (also 3200). However I don't think it would make that much difference. Ofcourse in 4 channel I have ram bandwidth x2 of your's but I run 34b and I load only 20 layers on gpu...