WizardLM (WizardLM-70b-v1.0.Q8_0 when quality is needed, WizardLM-30B Q5_K_M when speed is needed).
LocalLLaMA
Community to discuss about Llama, the family of large language models created by Meta AI.
If you can run Q8_0 but use Q5_K_M for speed, any reason you don't just run an exl2 at 8bpw?
Using Yi-34b Dolphin right now.
Can't wait to try Qwen 14b and 72b.
I use SOLAR-v0-70b , one of the best models out there. And the main point that I like, they run inference themselves (the creators of this model - "Upstage"), you can just connect to it via API. It's he best quality for the best price imo
THey run their inference on together.ai if you are interested.
Mostly I'm still using slightly older models, with a few slightly newer ones now:
-
marx-3b-v3.Q4_K_M.gguf for "fast" RAG inference,
-
medalpaca-13B.ggmlv3.q4_1.bin for medical research,
-
mistral-7b-openorca.Q4_K_M.gguf for creative writing,
-
NousResearch-Nous-Capybara-3B-V1.9-Q4_K_M.gguf for creative writing, and probably for giving my IRC bots conversational capabilities (a work in progress),
-
puddlejumper-13b-v2.Q4_K_M.gguf for physics research, questions about society and philosophy, "slow" RAG inference, and translating between English and German,
-
refact-1_6b-Q4_K_M.gguf as a coding copilot, for fill-in-the-middle,
-
rift-coder-v0-7b-gguf.git as a coding copilot when I'm writing python or trying to figure out my coworkers' python,
-
scarlett-33b.ggmlv3.q4_1.bin for creative writing, though less than I used to.
I also have several models which I've downloaded but not yet had time to evaluate, and am downloading more as we speak (though even more slowly than usual; a couple of weeks ago my download rates from HF dropped roughly in third, and I don't know why).
Some which seem particularly promising:
-
yi-34b-200k-llamafied.Q4_K_M.gguf
-
rocket-3b.Q4_K_M.gguf
-
llmware's "bling" and "dragon" models. I'm downloading them all, though so far there are only GGUFs available for three of them. I'm particularly intrigued at the prospect of llmware-dragon-falcon-7b-v0-gguf which is tuned specifically for RAG and is supposedly "hallucination-proofed", and llmware-bling-stable-lm-3b-4e1t-v0-gguf which might be a better IRC-bot conversational model.
Of all of these, the one I use most frequently is PuddleJumper-13B-v2.