this post was submitted on 18 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Looking for any model that can run with 20 GB VRAM. Thanks!

you are viewing a single comment's thread
view the rest of the comments
[–] tronathan@alien.top 1 points 10 months ago

I've been out of the loop for a bit, so despite this thread coming back again and again, I'm finding it useful/relevant/timely.

What I'm having a hard time figuring out is if I'm still SOTA with running text-generation-webui and exllama_hf. Thus far, I ALWAYS use GPTQ, ubuntu, and like to keep everything in RAM on 2x3090. (I also run my own custom chat front-end, so all I really need is an API.)

I know exllamav2 is out, exl2 format is a thing, and GGUF has supplanted GGML. I've also noticed a ton of quants from the bloke in AWQ format (often *only* AWQ, and often no GPTQ available) - but I'm not clear on which front-ends support AWQ. (I looked a vllm, but it seems like more of a library/package than a front-end.)

edit: Just checked, and it looks like text-generation-webui supports AutoAWQ. Guess I should have checked that earlier.

I guess I'm still curious if others are using something besides text-generation-webui for all-VRAM model loading. My only issue with text-generation-webui (that comes to mind, anyway) is that it's single-threaded; for doing experimentation with agents, it would be nice to be able to run multi threaded.