LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Question about GGUF, gpu offload and performance (alien.top)

submitted 1 year ago by Jokaiser2000@alien.top to c/localllama@poweruser.forum

7 comments fedilink hide all child comments

Hi. I'm currently running a 3060 12Gb | R7 2700X | 32gb 3200 | Windows 10 w/ latests nvidia drivers (vram>ram overflow disabled). By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Are these values what is expected of my setup? Or is there something i can do to improve speeds without changing the model?

Its pretty much unusable at this state, and since it's hard to find information about this topic i figured i would try to ask here.

EDIT: running the model on the latest version of the text-generation-webui

you are viewing a single comment's thread
view the rest of the comments

[–] vikarti_anatra@alien.top 1 points 1 year ago

some of my results:

System:

2xXeon E5-2680v4, 28 cores total, 56 HT, 128 Gb RAM

RTX 2060 6 Gb via PCIE x16 3.0

RTX 4060 Ti 16 Gb via PCIE x8 4.0

Windows 11 Pro

OpenHermes-2.5-AshhLimaRP-Mistral-7B (llama.cpp in text generation UI):

Q4_K_M,RTX 2060 6 Gb RAM, all 35 layers offloaded, 8k context, - approx 3 t/s

Q5_K_M,RTX 4060 Ti 16 Gb RAM, all 35 layers offloaded, 32k context - approx 25 t/s

Q5_K_M,CPU-only , 8 threads,32k context - approx 2.5-3.5 t/s

Q5_K_M,CPU-only , 16 threads,32k context - approx 3-3.5 t/s

Q5_K_M,CPU-only , 32 threads,32k context - approx 3-3.6 t/s

euryale-1.3-l2-70b (llama.cpp in text generation UI)

Q4_K_M,RTX 2060+RTX 4060 Ti,35 layers offloaded, 4K context - 0.6-0.8 t/s

goliath-120 (llama.cpp in text generation UI)

Q2_K, CPU-only,32 threads - 0.4-0.5 t/s

Q2_K, CPU-only,8 threads - 0.25-0.3 t/s

Noromaid-20b-v0.1.1 (llama.cpp in text generation UI)

Q5_K_M , RTX 2060+RTX 4060 Ti, 65 layers offloaded,4K context - approx 5 t/s

Noromaid-20b-v0.1.1 (exllamav2 in text generation UI)

3bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, approx 15 t/s (looks like it fits in 4060)

6bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, no flash attention, gpu split 12, 6 - approx 10 t/s

Observations:

- number of cores in cpu-only modes matters very little

- "numa" does matter (I have 2 CPU sockets)

I would say - try to get additional another card?