LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Chassis only has space for 1 GPU - Llama 2 70b possible on a budget? (alien.top)

submitted 2 years ago by Jugg3rnaut@alien.top to c/localllama@poweruser.forum

8 comments fedilink hide all child comments

I have a server with 512gb RAM and 2x Intel Xeon 6154. It has spare 16x pcie 3.0 slot once I get rid of my current gpu.

I'd like to add a better gpu so I can generate paper summaries (the responses can take a few minutes to come back) that are significantly better than the quality I get now with 4bit Llama2 13b. Anyone know whats the minimum gpu I should be looking at with this setup to be able to upgrade to the 70b model?Will hybrid cpu+gpu inference with RTX 4090 24GB be enough?

you are viewing a single comment's thread
view the rest of the comments

[–] kdevsharp@alien.top 1 points 2 years ago

Well, if you use llama.cpp and https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGUF model, and the Q5_K_M quantisation file, it uses 51.25 GB of memory. So your 24GB card can take less than half the layers of that file. So I guess if your offloading < half the layers to the graphics card, then it will be less than twice as fast as CPU only. Have you tried a quantised model like that with CPU only?