this post was submitted on 20 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I want to run a 70B LLM locally with more than 1 T/s. I have a 3090 with 24GB VRAM and 64GB RAM on the system.

What I managed so far:

  • Found instructions to make 70B run on VRAM only with a 2.5B that run fast but the perplexity was unbearable. LLM was barely coherent.
  • I randomly made somehow 70B run with a variation of RAM/VRAM offloading but it run with 0.1 T/S

I saw people claiming reasonable T/s speeds. Sine I am a newbie, I barely can speak the domain language, and most instructions I found assume implicit knowledge I don't have*.

I need explicit instructions on what 70B model to download exactly, which Model loader to use and how to set parameters that are salient in the context.

you are viewing a single comment's thread
view the rest of the comments
[–] mrjackspade@alien.top 1 points 10 months ago (2 children)

If you're only getting 0.1 then you've probably overshot your layer offloading.

I can get up to 1.5 t/s with a 3090, at 5_K_M

Try running Llama.cpp from the command line with 30 layers offloaded to the gpu, and make sure your thread count is set to match your (physical) CPU core count

The other problem you're likely running into is that 64gb of RAM is cutting it pretty close. Make sure your base OS usage is below 8GB if possible and try memory locking the model on load. The problem is that with that amount of system ram, its possible you have other applications running causing the OS to page the model data out to disk, which kills performance

[–] BlueMetaMind@alien.top 1 points 10 months ago (1 children)

Thank you. What does " at 5_K_M" mean ?
Can I use the text web UI with Llama.cpp as model loader or is this too much overhead for ?

[–] mrjackspade@alien.top 1 points 10 months ago

I actually don't know how much overhead that's going to be. I'd start by just kicking it off on the command line first as a proof of concept, its super easy,

5_K_M is just the quantization I use. There's almost no loss of perplexity with 5_K_M, but its also larger than 4 which is what most people use.

Name Quant method Bits Size Max RAM required Use case
goat-70b-storytelling.Q2_K.gguf Q2_K 2 29.28 GB 31.78 GB smallest, significant quality loss - not recommended for most purposes
goat-70b-storytelling.Q3_K_S.gguf Q3_K_S 3 29.92 GB 32.42 GB very small, high quality loss
goat-70b-storytelling.Q3_K_M.gguf Q3_K_M 3 33.19 GB 35.69 GB very small, high quality loss
goat-70b-storytelling.Q3_K_L.gguf Q3_K_L 3 36.15 GB 38.65 GB small, substantial quality loss
goat-70b-storytelling.Q4_0.gguf Q4_0 4 38.87 GB 41.37 GB legacy; small, very high quality loss - prefer using Q3_K_M
goat-70b-storytelling.Q4_K_S.gguf Q4_K_S 4 39.07 GB 41.57 GB small, greater quality loss
goat-70b-storytelling.Q4_K_M.gguf Q4_K_M 4 41.42 GB 43.92 GB medium, balanced quality - recommended
goat-70b-storytelling.Q5_0.gguf Q5_0 5 47.46 GB 49.96 GB legacy; medium, balanced quality - prefer using Q4_K_M
goat-70b-storytelling.Q5_K_S.gguf Q5_K_S 5 47.46 GB 49.96 GB large, low quality loss - recommended
goat-70b-storytelling.Q5_K_M.gguf Q5_K_M 5 48.75 GB 51.25 GB large, very low quality loss - recommended
goat-70b-storytelling.Q6_K.gguf Q6_K 6 56.59 GB 59.09 GB very large, extremely low quality loss
goat-70b-storytelling.Q8_0.gguf Q8_0 8 73.29 GB 75.79 GB very large, extremely low quality loss - not recommended
[–] GoofAckYoorsElf@alien.top 1 points 10 months ago (1 children)

that 64gb of RAM is cutting it pretty close

Holy crap...

[–] BlueMetaMind@alien.top 1 points 10 months ago

Yeah… I thought I’ll be at least “in the room “ buying my setup last year, but it turns out I’m outside in the gutter 🫣😢