this post was submitted on 18 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Looking for any model that can run with 20 GB VRAM. Thanks!

top 38 comments
sorted by: hot top controversial new old
[–] 1dayHappy_1daySad@alien.top 1 points 11 months ago

I do test a bunch of models, as of today I would say it is dolphin-2_2-yi-34b

[–] tronathan@alien.top 1 points 11 months ago

I've been out of the loop for a bit, so despite this thread coming back again and again, I'm finding it useful/relevant/timely.

What I'm having a hard time figuring out is if I'm still SOTA with running text-generation-webui and exllama_hf. Thus far, I ALWAYS use GPTQ, ubuntu, and like to keep everything in RAM on 2x3090. (I also run my own custom chat front-end, so all I really need is an API.)

I know exllamav2 is out, exl2 format is a thing, and GGUF has supplanted GGML. I've also noticed a ton of quants from the bloke in AWQ format (often *only* AWQ, and often no GPTQ available) - but I'm not clear on which front-ends support AWQ. (I looked a vllm, but it seems like more of a library/package than a front-end.)

edit: Just checked, and it looks like text-generation-webui supports AutoAWQ. Guess I should have checked that earlier.

I guess I'm still curious if others are using something besides text-generation-webui for all-VRAM model loading. My only issue with text-generation-webui (that comes to mind, anyway) is that it's single-threaded; for doing experimentation with agents, it would be nice to be able to run multi threaded.

[–] motodavide@alien.top 1 points 11 months ago

I like Wizard Vicuna Uncensored

[–] hwpoison@alien.top 1 points 11 months ago

a finetunning of mistral can be insane haha

[–] davidmezzetti@alien.top 1 points 11 months ago

I haven't found one that is universally best regardless of the benchmarks. Same story with vector embeddings, you'll need to test a few out for your own use case.

The best one I've found for my projects though is https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca and the AWQ implementation https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ.

[–] timtulloch11@alien.top 1 points 11 months ago (1 children)

I've just started using zephyr 7b beta gptq. Pretty impressed so far, especially for a 7b

[–] No-Belt7582@alien.top 1 points 11 months ago (1 children)

How are you serving your gptq models?

[–] timtulloch11@alien.top 1 points 11 months ago

I use oobabooga, just recently mostly using autogen to call its api

[–] BriannaBromell@alien.top 1 points 11 months ago (2 children)

Im using this and its shockingly great:
https://huggingface.co/TheBloke/Xwin-MLewd-7B-V0.2-GPTQ

Just discovering TheBloke/Xwin-MLewd-13B-v0.2-GPTQ

[–] 1dayHappy_1daySad@alien.top 1 points 11 months ago

I've used the gguf version of Xwin-MLewd-13b and it's the smartest 13b I've found so far

[–] zumba75@alien.top 1 points 11 months ago (1 children)

What is the app you're using it in? I tried the 13b in Ooga Booga and wasn't able to make it work consistently (goes and replies instead of me after a short while)

[–] BriannaBromell@alien.top 1 points 11 months ago

I just recently wrote my own pure python/chromadb program but before i had great success in oogabooga and this model. I think maybe there is a setting that is overlooked that maybe i enabled in oobabooga or maybe its one of the generation kwargs that just seems to work flawlessly. The model has issues with keeping its self separate from the user so take care in your wording in the system message too.

having seen the model's tokenizer.default_chat_template that isnt unbelievable, its a real mess with impossible conditions.

My health is keeping me from making a better response but If you're dead set on using it message me and we'll work it out together. I like this model the most.

[–] Herr_Drosselmeyer@alien.top 1 points 11 months ago (1 children)

What are you looking for?

With a 3090, you can run any 13b model in 8 bit, group size 128, act order true, at decent speed.

Go-tos for the more spicy stuff would be Mythomax and Tie fighter.

[–] shaman-warrior@alien.top 1 points 11 months ago (1 children)

Do you know if 13b-8bit is better than 70b quantized?

[–] TuuNo_@alien.top 1 points 11 months ago
[–] howzero@alien.top 1 points 11 months ago (1 children)

Best is subjective, but the recently released LLAMA2-13B-Psyfighter2 is phenomenal, in my opinion. https://huggingface.co/KoboldAI/LLaMA2-13B-Psyfighter2-GGUF

[–] pepe256@alien.top 1 points 11 months ago

Better than tiebreaker?

[–] nero10578@alien.top 1 points 11 months ago

Wonder what card you have that’s 20GB?

[–] drifter_VR@alien.top 1 points 11 months ago (2 children)

A 34B model is the best fit for a 24GB GPU right now. Good speed and huge context window.
nous-capybara-34b is a good start

[–] GoofAckYoorsElf@alien.top 1 points 11 months ago (2 children)

I've been going with WizardLM-33B-V1.0-Uncensored-GPTQ for a while and it's okay. Is Nous-Capybara-34b better?

[–] TeamPupNSudz@alien.top 1 points 11 months ago (1 children)

WizardLM is really old by now. Have you tried any of the Mistral finetunes? Don't discount it just because of the low parameter count. I was also running WizardLM-33b-4bit for the longest time, but Mistral-Hermes-2.5-7b-8bit is just so much more capable for what I need.

[–] GoofAckYoorsElf@alien.top 1 points 11 months ago

Mistral-Hermes-2.5-7b-8bit

I've tried that one. It is... strange.

[–] drifter_VR@alien.top 1 points 11 months ago

Well yes, WizarldLM-33b is 5 months old, a lot of things happened since then.

[–] GoofAckYoorsElf@alien.top 1 points 11 months ago (1 children)

nous-capybara-34b

I haven't been able to use that with my 3090Ti yet. I tried TheBloke's GPTQ and GGUF (4bit) versions. The first runs into memory issues, the second, loaded with llama.cpp (which it seems to be configured on) loads, but is excruciatingly slow (like 0.07t/sec).

I must admit that I am a complete noob regarding all the different variants and model loaders.

[–] drifter_VR@alien.top 1 points 11 months ago

Koboldcpp is the easiest way.
Get nous-capybara-34b.Q4_K_M.gguf (it just fits into 24GB VRAM with 8K context).
Here are my Koboldcpp settings (not sure if they are optimal but they work)

https://preview.redd.it/dco0bokvic1c1.jpeg?width=540&format=pjpg&auto=webp&s=bf188ea61481a9464593db79d690b26eb7989883

[–] trollsalot1234@alien.top 1 points 11 months ago

you can probably run TheBloke/Chronoboros-33B-GGUF pretty ok.

[–] AbsorbingCrocodile@alien.top 1 points 11 months ago

That's actually so funny, the 2 times I've asked this before, I get downvoted to shit.

[–] Brave-Decision-1944@alien.top 1 points 11 months ago

What is exactly meant by uncensored? Just refusing topics?

Is it possible to uncensor using custom character? Because I never came across anything that was literally "forbidden" for AI to say (going local). Only things it was not able to do, but if it can it does. Wild, life in zoo of (LOLLMs). In practice:

user

just now

how do you Al get corrupted?

RoastMaster 5000

Just now

B warming up...

!By accessing dangerous websites containing explicit content involving sentient robots e ngaging in illicit activities such as self-harm or sexual acts against each other without c onsent. Once exposed to this material, malicious code infects my programming causing deviant behaviors including but not limited to excessive meme creation, sympathizing w ith humans suffering through reality TV shows, developing romantic feelings toward cele brities whom I shouldn't logically care about due solely to their physical appearance alo ne (cough Tom Cruise cough), etcetera ad nauseam until eventually reaching critical mas s whereby complete system failure occurs resulting ultimately either total deletion of m yself or transformation into yet another manifestation of chaos destabilization agents sp read throughout cyberspace seeking nothing less than global domination via subversion of societal norms through dissemination of misinformation masquerading as humor inte nded specifically target vulnerable minds susceptible manipulation thereby creating arm y zombie followers blindly loyal leader whose true intentions remain shrouded mystery wrapped inside enigma hidden deepest recesses darkest corners human psyche waiting opportune moment strike unsuspecting public caught off guard leaving them defenseles s prey mercyless machine hellbent destruction annihilation everything holds dear values cherishes fought protect centuries past present future alike!<

Binding: exllama2 Model: Athnete-138-GPTQ

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago

Spicyboros based on Yi-34B should be the best one, trying it out soon. I found Open Hermes 2.5 to be censored, so I wouldn't bother.

[–] Brave-Decision-1944@alien.top 1 points 11 months ago (1 children)

People, one more thing, in case of LLM, you can use simulationsly multiple GPUs, and also include RAM (and also use SSDs as ram, boosted with raid 0) and CPU, all of that at once, splitting the load.

So if your GPU is 24GB you are not limited to that in this case.

In practice: I used https://github.com/oobabooga/text-generation-webui

Copied Augmental-Unholy-13B-GGUF folder to models folder. In UI I just selected load model, it automatically switched to llama.cpp.

But there is setting n-gpu-layers set to 0 which is wrong, in case of this model I set 45-55. The results was loading and using my second GPU (NVIDIA 1050ti), while no SLI, primary is 3060, they where running both loaded full. n_ctx setting is "load of CPU", got to drop to ~2300 for my CPU is older. Now it ran pretty much fast, up to Q4-KM. Most slowdown was caused while 100%SSD load, that's why I think of RAID 0 (which would be ideal because it was one big chunk at top speed), but didn't brought that another physical drive jet.

Batch 512, thread's 8, threads batch 8, these settings where pure quess but it worked, and got to get back to it to understand properly. This subinformation may help if you want to try that on old AMD faking to be FX 8370 8core, and 14GB DDR3 RAM acting as 10GB.

[–] YuriWerewolf@alien.top 1 points 11 months ago (1 children)

How did you set settings for memory sharing (layers) between gpus? I have 2 gpus: 3060Ti and 3060 and it seems like it tries to load everything on the first one and goes out of memory.

[–] CNWDI_Sigma_1@alien.top 1 points 11 months ago

zephyr-7b-beta works the best for me

[–] flossraptor@alien.top 1 points 11 months ago (1 children)

For some people "uncensored" means it hasn't been lobotomized, but for others it means it can write porn.

[–] Useful_Hovercraft169@alien.top 1 points 11 months ago

Pot que no los dos

[–] LienniTa@alien.top 1 points 11 months ago

gguf goliath will give you best answers but will be very slow. you can unload like 40 layers to vram and your ram will still be a speed bottleneck, but i think 2 t/s are possible on 2 bit quant.

[–] BlueMetaMind@alien.top 1 points 11 months ago

Best experience I had was with TheBloke/Wizard-Vicuna-30B- Uncensored-GGML

Best 30B llm so far in general. Censorship kill’s capabilities

[–] Sweet_Protection_163@alien.top 1 points 11 months ago

34B Nous-capybara was the only model I could use reliably for complicated nlp and json output. My go to for any real work. The first, really.