this post was submitted on 27 Nov 2023

1 points (100.0% liked)

LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Models Megathread #2 - What models are you currently using? (alien.top)

submitted 2 years ago by Technical_Leather949@alien.top to c/localllama@poweruser.forum

56 comments fedilink hide all child comments

As requested, this is the subreddit's second megathread for model discussion. This thread will now be hosted at least once a month to keep the discussion updated and help reduce identical posts.

I also saw that we hit 80,000 members recently! Thanks to every member for joining and making this happen.

Welcome to the r/LocalLLaMA Models Megathread

What models are you currently using and why? Do you use 7B, 13B, 33B, 34B, or 70B? Share any and all recommendations you have!

Examples of popular categories:

Assistant chatting
Chatting
Coding
Language-specific
Misc. professional use
Role-playing
Storytelling
Visual instruction

Have feedback or suggestions for other discussion topics? All suggestions are appreciated and can be sent to modmail.

^(P.S. LocalLLaMA is looking for someone who can manage Discord. If you have experience modding Discord servers, your help would be welcome. Send a message if interested.)

Previous Thread | New Models

top 50 comments

sorted by: hot top controversial new old

[–] multiverse_fan@alien.top 1 points 2 years ago

What would have happened if ChatGPT was invented in the 17th century? MonadGPT is a possible answer.

TheBloke/MonadGPT-GGUF

[–] MeMyself_And_Whateva@alien.top 1 points 2 years ago

Goliat 120B

[–] VertexMachine@alien.top 1 points 2 years ago

LoneStriker_OpenHermes-2-Mistral-7B-8.0bpw-h6-exl2 - my generic goto
LoneStriker_airoboros-l2-70b-3.1-2.4bpw-h6-exl2 - this one (and the whole family) is great for creative and precise tasks. If they don't work I jump to wizardlm or vicuna.
oobabooga_CodeBooga-34B-v0.1-EXL2-4.250b and phind-codellama-34b-v2.Q4_K_M.gguf are great for coding. I haven't decided which one is better yet.

[–] Helpful-Gene9733@alien.top 1 points 2 years ago (1 children)

With a system limited machine (2017 i5 iMac Cpu only) I am getting very pleasing results with:

Openhermes2-mistral (7B 4bit K_M quant) for general chat, desktop assistant, and some coding assistance - Ollama backend with my own front end U/I and llama-index libraries implementation. Haven’t tried 2.5 but may.

Synatra 7B mistral fine tune (4bit K_M quant) seems to produce longer responses and spicier with same system prompt (same use case as above)

Deepseek-coder 6.7B (4bit quant) as a coding assistant alternative to GPT-3.5 - just trying out in last week or so and building the personalized coding assistant front end u/I for fun

OrcaMini-3B - for chat when I just want something smaller and faster to run on my machine - the 7B quants are about max for the old iMac. But OrcaMini sometimes doesn’t give great stuff for me.

[–] SideShow_Bot@alien.top 1 points 2 years ago (1 children)

IIUC, for coding you suggest deepseek-coder-6.7b-instruct.Q4_K_M.gguf, right? Can I run it with 16 Gb? I'm on a i5 Windows machine, using LM Studio.

[–] Helpful-Gene9733@alien.top 1 points 2 years ago

Yes that’s the one from The Bloke. I imagine you could, but try it! I can run it on an old i5 3.4 GHz chip with 8GB RAM and it seems to run as long as I’m not trying to keep a bunch of stuff open and using up RAM. I haven’t really used it a lot so can’t tell fully yet.

[–] tortistic_turtle@alien.top 1 points 2 years ago (1 children)

openhermes 2.5 as an assistant

tiefighter for other use

[–] involviert@alien.top 1 points 2 years ago

Openhermes seems pretty capable of "other use", no?

[–] FutureIsMine@alien.top 1 points 2 years ago (1 children)

Llama2-70B for generating the plan and than using CodeLlama-34B for coding, or LLama-13B for executing the instructions from LLama-2-70B

Currently in the process of exploring what other models to add once LLama2-70B generates the plan for what needs to get done

[–] Mr_Finious@alien.top 1 points 2 years ago (1 children)

What do you mean by generating the plan? Can you describe your workflow ?

[–] FutureIsMine@alien.top 1 points 2 years ago (1 children)

Lets say you've got a task like write a blog post. Instead of issuing a single command, have a GPT model plan it out. Something akin to

system: You are a planning AI, you will come up with a plan that will assist the user in any task they need help with as best you can. You will layout a clear and well followed plan.   
User: Hello Planner AI, I need your help with coming up with a plan for the following   task : {user_prompt}

So the now LLama2-70B generates a plan that has steps in it that are numbered. Next, you can regex on the numbers and than pass that along to the worker model that will execute the task. As LLMS write more than humans and add in additional details that LLMS can follow, the subsequent LLMs will do a better job in executing the task rather than if you asked a smaller model write me a blog post about 3D printing D&D minis. Now go replace the task of writing a blog post with whatever it is you're doing and you'll be getting results

[–] Mr_Finious@alien.top 1 points 2 years ago

Wow. Thank you so much for this explanation !!! ❤️

[–] CasimirsBlake@alien.top 1 points 2 years ago (4 children)

A few folks mentioning EXL2 here. Is this now the preferred Exllama format over GPTQ?

[–] Biggest_Cans@alien.top 1 points 2 years ago

I won't use anything else for GPU processing.

The quality bump I've seen for my 4090 is very noticeable in speed, coherence and context.

Wild to me that thebloke doesn't ever use it.

Easy enough to find quants though if you just go to models and search "exl2" and sort by whatever.

[–] mcmoose1900@alien.top 1 points 2 years ago

In addition to what others said, exl2 is very sensitive to the quantization dataset, which it uses to choose where to assign those "variable" bits.

Most online quants use wikitext. But I believe if you quantize models yourself on your own chats, you can get better results, especially below 4bpw.

[–] TheMightyCatt@alien.top 1 points 2 years ago

EXL2 provides more options and has a smaller quality decrease for as far as I know.

[–] sophosympatheia@alien.top 1 points 2 years ago (1 children)

EXL2 runs fast and the quantization process implements some fancy logic behind the scenes to do something similar to k_m quants for GGUF models. Instead of quantizing every slice of the model to the same bits per weight (bpw), it determines which slices are more important and uses a higher bpw for those slices and a lower bpw for the less-important slices where the effects of quantization won't matter as much. The result is the average bits per weight across all the layers works out to be what you specified, say 4.0 bits per weight, but the performance hit to the model is less severe than its level of quantization would suggest because the important layers are maybe 5.0 bpw or 5.5 bpw, something like that.

In short, EXL2 quants tend to punch above their weight class due to some fancy logic going on behind the scenes.

[–] CasimirsBlake@alien.top 1 points 2 years ago

Thank you! I'm reminded of variable bit rate encoding used in various audio and video formats, this sounds not dissimilar.

[–] roselan@alien.top 1 points 2 years ago (1 children)

Yi-34B-Chat

It's not the most uncensored, and probably not the best, but I really like it's prose and coherence.

And Q4_K_M guff runs on my 32gb ram laptop.

(and yes it's slow)

[–] cauIkasian@alien.top 1 points 2 years ago (1 children)

What kind of stuff do you use it for?

[–] roselan@alien.top 1 points 2 years ago

stupid stuff and silly scenarios. My latest:

Jane, Marc bratty over energetic sister, really wants to borrow Marc shiny new convertible. Marc is not so sure...

Write their over the top bickering. Jane is relentless and stops at nothing, to the exasperation of Marc.

For all serious stuff I use gpt4 of course.

[–] USM-Valor@alien.top 1 points 2 years ago

13B and 20B Noromaid for RP/ERP.

I am experimenting with comparing GGUF to EXL2 as well as stretching context. So far, Noromaid 13b at GGUF Q5_K_M stretches to 12k context on a 3090 without issues. Noromaid 20B at Q3_K_M stretches to 8k without issues and is in my opinion superior to the 13B. I have recently stretched Noromaid 20B to 10k using 4bpw EXL2 and it is giving coherent responses. I haven't used it enough to assess the quality however.

All this is to say, if you enjoy roleplay you should be giving Noromaid a look.

[–] ehlowrld@alien.top 1 points 2 years ago

TheBloke/mistral-7B-finetuned-orca-dpo-v2-GGUF

Lets most 13B models bite the dust. I use it for a local application - thus inference on CPU-only using llama.cpp with clblast support compiled in. Generates about 10 tokens / sec. on a Dell laptop with intel i7.

[–] TobyWonKenobi@alien.top 1 points 2 years ago (1 children)

Deepseek coder 34b for code

OpenHermes 2.5 for general chat

Yi-34b chat is ok too, but I am a bit underwhelmed when I use it vs Hermes. Hermes seems to be more consistent and hallucinate less.

It’s amazing that I am still using 7b when there are finally decent 34b models.

[–] Akimotoh@alien.top 1 points 2 years ago (1 children)

Did you notice a big difference between Deepseek coder 34B and it's 7B version? What are the system requirements for 34B? It looks to be around 70GBs in size..

[–] TobyWonKenobi@alien.top 1 points 2 years ago

I honestly haven’t tried the 6.7b version of Deepseek yet, but I’ve heard great things about it!

You can run 34b models in q4 k m quant because it’s only ~21 GB . I run it with one 3090.

[–] balianone@alien.top 1 points 2 years ago

deepseek & phind for code

[–] Illustrious_Sand6784@alien.top 1 points 2 years ago

Golaith-120B (specifically the 4.85 BPW quant) is the only model I use now, I don't think I can go back to using a 70B model after trying this.

[–] swagonflyyyy@alien.top 1 points 2 years ago (1 children)

Mistral-7B-Instruct 4_K quant and openhermes2.5-7B-mistral 4_K quant. Still testing the waters but starting with these two first.

[–] ibbobud@alien.top 1 points 2 years ago (1 children)

What kind of use cases you using them for?

[–] swagonflyyyy@alien.top 1 points 2 years ago

NPC testing,: https://www.reddit.com/r/notinteresting/s/UDzuosjZlj

[–] sophosympatheia@alien.top 1 points 2 years ago (1 children)

I'm one of those weirdos merging 70b models together for fun. I mostly use my own merges now as they've become quite good. (Link to my Hugging Face page where I share my merges.) I'm mostly interested in roleplaying and storytelling with local LLMs.

[–] Simusid@alien.top 1 points 2 years ago (1 children)

What method do you use to merge them? Mixture of experts?

[–] sophosympatheia@alien.top 1 points 2 years ago

There are several popular methods, all supported by the lovely mergekit project at https://github.com/cg123/mergekit.

The ties merge method is the newest and most advanced method. It works well because it implements some logic to minimize how much the models step on each other's toes when you merge them together. Mergekit also makes it easy to do "frankenmerges" using the passthrough method where you interleave layers from different models in a way that extends the resultant model's size beyond the normal limits. For example, that's how goliath-120b was made from two 70b models merged together.

[–] toothpastespiders@alien.top 1 points 2 years ago

I'm really late on this one, but dolphin 2.0 mistral 7b. I did a little extra training on it for some automation and the thing's ridiculously solid, fast, and light on resource usage. I'm still cleaning up the output a bit after it's chugging away at night. But to a pretty minor degree.

Though if failures count then Yi 34b's up there in terms of usage this week too. As I fail a million times over just to train a simple, single, usable, lora for it.

[–] Sweet_Protection_163@alien.top 1 points 2 years ago

34b CapyB for production work.

[–] TobyWonKenobi@alien.top 1 points 2 years ago (3 children)

Has anyone tried out TheBloke's quants for 7b openhermes 2 5 neural chat v3 1?

7b OpenHermes 2.5 was really good by itself, but the merge with neural chat seems REALLY good so far based on my limited chats with it.

https://huggingface.co/TheBloke/OpenHermes-2.5-neural-chat-7B-v3-1-7B-GGUF

[–] Enzor@alien.top 1 points 2 years ago

This seems like it would be pretty good. Downloading now to try it, thanks!

[–] TheManicProgrammer@alien.top 1 points 2 years ago

Any chance of positing your settings? :D

[–] CardAnarchist@alien.top 1 points 2 years ago

After seeing your comment I tried the OpenHermes-2.5-neural-chat-7B-v3-1-7B-GGUF model you mention.

Unfortunately setup the way I am it didn't respond very well for me.

Honestly I don't think the concept of that merge is too good to be frank.

OpenHermes is fantastic. If I had to state it's flaws I'd say it's prose is a bit dry and the dialogue seems to speak past you in a way rather than clearly responding to you. Only issues for roleplay really.

From all I've read neuralchat is much the same (tbh though I've not got neuralchat to work particularly well for me at all) so any merge created from those two models I would expect to be a bit lacking in the roleplay department.

That said if you are wanting a model for more professional purposes it might be worth further testing.

For roleplay Misted-7B is leagues better. At least in my testing in my setup.

[–] DrVonSinistro@alien.top 1 points 2 years ago

Because a model can be divine or crap with some settings, I think its important I specify that I use:

Deepseek 33b q8 gguf with the Min-p setting (I love it very much)

Source of my Min-p settings: (1) Your settings are (probably) hurting your model - Why sampler settings matter : LocalLLaMA (reddit.com)

[–] Secret_Joke_2262@alien.top 1 points 2 years ago

70b Storytelling q5 k m

[–] HvskyAI@alien.top 1 points 2 years ago (1 children)

I’m late to the party on this one.

I’ve been loving the 2.4BPW EXL2 quants from Lone Striker recently, specifically using Euryale 1.3 70B and LZLV 70B.

Even at the smaller quant, they’re very capable, and leagues ahead of smaller models in terms of comprehension and reasoning. Min-P sampling parameters have been a big step forward, as well.

The only downside I can see is the limitation to context length on a single 24GB VRAM card. Perhaps further testing of Nous-Capyabara 34B at 4.65BPW on EXL2 is in order.

[–] FullOf_Bad_Ideas@alien.top 1 points 2 years ago

Remember to try 8-bit cache If you haven't yet, it should get you to 5.5k tokens context length.

You can get around 10-20k context length with 4bpw yi-34b 200k quants on single 24GB card.

[–] LeanderGem@alien.top 1 points 2 years ago

I'm really digging https://huggingface.co/TheBloke/PsyMedRP-v1-20B-GGUF for storytelling. I wish I could use a higher GGUF but it's all that I can manage atm.

[–] ReMeDyIII@alien.top 1 points 2 years ago (1 children)

Exclusively 70B models. Current favorite is:

Role-playing: lzlv 70B GPTQ on gptq-4bit-32g-actorder_True

Although ask me again a week from now and my answer will probably change. That's how quick improvements are.

[–] silenceimpaired@alien.top 1 points 2 years ago (1 children)

It’s only been a day but have you changed? I find this model misspells a lot with the gguf i downloaded.

[–] ReMeDyIII@alien.top 1 points 2 years ago

At the current moment I have not changed, but Wolfram released a good rankings list that makes me want to test Tess-XL-v1.0-120b and Venus-120b.

I'm using lzlv GPTQ via ST's Default + Alpaca prompt and didn't have misspelling issues. Wolfram did notice misspelling issues when using the Amy preset (e. g. "sacrficial") so maybe switch the preset?

[–] Dusty_da_Cat@alien.top 1 points 2 years ago (1 children)

Goliath 120B - Exllama2 3bpw @ 10 tok/s

load more comments (1 replies)

[–] Yogapants_73@alien.top 1 points 2 years ago

Yi 34b

Only using it because I'm in the middle of an upgrade and so far all I've added is an extra stick of ram which lets me barely use Yi 34b. Waiting on another stick of ram + a second GPU to run LZLV 70b

load more comments