this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I have a server with 512gb RAM and 2x Intel Xeon 6154. It has spare 16x pcie 3.0 slot once I get rid of my current gpu.

I'd like to add a better gpu so I can generate paper summaries (the responses can take a few minutes to come back) that are significantly better than the quality I get now with 4bit Llama2 13b. Anyone know whats the minimum gpu I should be looking at with this setup to be able to upgrade to the 70b model?Will hybrid cpu+gpu inference with RTX 4090 24GB be enough?

you are viewing a single comment's thread
view the rest of the comments
[โ€“] Aaaaaaaaaeeeee@alien.top 1 points 11 months ago (1 children)

Last I checked, 38t/s is minimum prompt processing speeds with zero layers offloaded on a 3090 for 70B q4_k_m

I'm sure its way higher now. When you offload layers, you can do more, but I think you have to have pre knowledge of the max length, so that your gpu doesnt OOM towards the end.

I think your supposed to adjust the prompt processing batch size settings also.

I highly recommend checking the nvidia PRs in llama.cpp for the prompt processing speeds, for differences between GPUs. If they have double or triple that will tell you something and you could calculate the amount of time for processing your text.

[โ€“] AdOne8437@alien.top 1 points 11 months ago

What model did you use and what model loader?