LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

How to run 70B on 24GB VRAM ? (alien.top)

submitted 11 months ago by BlueMetaMind@alien.top to c/localllama@poweruser.forum

12 comments fedilink hide all child comments

I want to run a 70B LLM locally with more than 1 T/s. I have a 3090 with 24GB VRAM and 64GB RAM on the system.

What I managed so far:

Found instructions to make 70B run on VRAM only with a 2.5B that run fast but the perplexity was unbearable. LLM was barely coherent.
I randomly made somehow 70B run with a variation of RAM/VRAM offloading but it run with 0.1 T/S

I saw people claiming reasonable T/s speeds. Sine I am a newbie, I barely can speak the domain language, and most instructions I found assume implicit knowledge I don't have*.

I need explicit instructions on what 70B model to download exactly, which Model loader to use and how to set parameters that are salient in the context.

you are viewing a single comment's thread
view the rest of the comments

[–] TuuNo_@alien.top 1 points 11 months ago (5 children)

I would suggest you to use Koboldcpp and run GGUF. A 70B Q5 model, with around 40 layers loaded into GPU, should have more than 1t/s. At least for me, I got 1.5t/s with 4090 and 64GB ram using Q5_K_M.

[–] silenceimpaired@alien.top 1 points 11 months ago (4 children)

I could never get up and running on Linux with Nvidia. I used Kobold on Windows, but boy is it painful on Linux.

[–] TuuNo_@alien.top 1 points 11 months ago (1 children)

Well, I have never used Linux before since the main purpose of my pc is gaming. But I heard running LLMs on Linux is overall faster.

[–] silenceimpaired@alien.top 1 points 11 months ago

It is… but koboldcpp doesn’t have a executable for me to run :/

load more comments (2 replies)