LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

How to run 70B on 24GB VRAM ? (alien.top)

submitted 2 years ago by BlueMetaMind@alien.top to c/localllama@poweruser.forum

12 comments fedilink hide all child comments

I want to run a 70B LLM locally with more than 1 T/s. I have a 3090 with 24GB VRAM and 64GB RAM on the system.

What I managed so far:

Found instructions to make 70B run on VRAM only with a 2.5B that run fast but the perplexity was unbearable. LLM was barely coherent.
I randomly made somehow 70B run with a variation of RAM/VRAM offloading but it run with 0.1 T/S

I saw people claiming reasonable T/s speeds. Sine I am a newbie, I barely can speak the domain language, and most instructions I found assume implicit knowledge I don't have*.

I need explicit instructions on what 70B model to download exactly, which Model loader to use and how to set parameters that are salient in the context.

you are viewing a single comment's thread
view the rest of the comments

[–] BlueMetaMind@alien.top 1 points 2 years ago (1 children)

Thank you. What does " at 5_K_M" mean ?
Can I use the text web UI with Llama.cpp as model loader or is this too much overhead for ?

[–] mrjackspade@alien.top 1 points 2 years ago

I actually don't know how much overhead that's going to be. I'd start by just kicking it off on the command line first as a proof of concept, its super easy,

5_K_M is just the quantization I use. There's almost no loss of perplexity with 5_K_M, but its also larger than 4 which is what most people use.

Name	Quant method	Bits	Size	Max RAM required	Use case
goat-70b-storytelling.Q2_K.gguf	Q2_K	2	29.28 GB	31.78 GB	smallest, significant quality loss - not recommended for most purposes
goat-70b-storytelling.Q3_K_S.gguf	Q3_K_S	3	29.92 GB	32.42 GB	very small, high quality loss
goat-70b-storytelling.Q3_K_M.gguf	Q3_K_M	3	33.19 GB	35.69 GB	very small, high quality loss
goat-70b-storytelling.Q3_K_L.gguf	Q3_K_L	3	36.15 GB	38.65 GB	small, substantial quality loss
goat-70b-storytelling.Q4_0.gguf	Q4_0	4	38.87 GB	41.37 GB	legacy; small, very high quality loss - prefer using Q3_K_M
goat-70b-storytelling.Q4_K_S.gguf	Q4_K_S	4	39.07 GB	41.57 GB	small, greater quality loss
goat-70b-storytelling.Q4_K_M.gguf	Q4_K_M	4	41.42 GB	43.92 GB	medium, balanced quality - recommended
goat-70b-storytelling.Q5_0.gguf	Q5_0	5	47.46 GB	49.96 GB	legacy; medium, balanced quality - prefer using Q4_K_M
goat-70b-storytelling.Q5_K_S.gguf	Q5_K_S	5	47.46 GB	49.96 GB	large, low quality loss - recommended
goat-70b-storytelling.Q5_K_M.gguf	Q5_K_M	5	48.75 GB	51.25 GB	large, very low quality loss - recommended
goat-70b-storytelling.Q6_K.gguf	Q6_K	6	56.59 GB	59.09 GB	very large, extremely low quality loss
goat-70b-storytelling.Q8_0.gguf	Q8_0	8	73.29 GB	75.79 GB	very large, extremely low quality loss - not recommended