this post was submitted on 01 Dec 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I'm thinking of upgrading to 64GB of RAM so I can lot larger models on my rtx 3090.

If I want to run tigerbot-70b-chat-v2.Q5_K_M.gguf which has max RAM usage of 51.61GB, assuming I load 23GB worth of layers into VRAM that leaves 51.61-23=28.61 left to load in RAM. My operating system already uses up to 9.2GB of RAM which means I need 37.81GB of RAM (hence 64GB).

How many tokens/s can I expect from 23GB out of 51.61GB being loaded in VRAM, and 28.61GB being loaded in RAM on an rtx 3090? I'm mostly curious about Q5_K_M quant, but I'm still interested in other quants.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here