this post was submitted on 28 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hi. I'm currently running a 3060 12Gb | R7 2700X | 32gb 3200 | Windows 10 w/ latests nvidia drivers (vram>ram overflow disabled). By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Are these values what is expected of my setup? Or is there something i can do to improve speeds without changing the model?

Its pretty much unusable at this state, and since it's hard to find information about this topic i figured i would try to ask here.

EDIT: running the model on the latest version of the text-generation-webui

you are viewing a single comment's thread
view the rest of the comments
[โ€“] longtimegoneMTGO@alien.top 1 points 11 months ago (1 children)

I have a 3080 12Gb, and can run a 20B-Q4_K_M with about 50 layers offloaded and 8k context.

It starts off at just under 4 t/s, and once the context is filled it gets as slow as just over 2 t/s

It might be worth setting up a linux partition to boot into for this, I was getting much slower speeds under windows.

[โ€“] Jokaiser2000@alien.top 1 points 11 months ago

That might be worth a try actually, i'll look into it, thanks