this post was submitted on 09 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I've used most of the high-end models in an unquantized format at some point or another (Xwin, Euryale, etc.) and found them generally pretty good experiences, but always seem to lack the ability to "show, not tell" in a way that a strong writer knows how to do, even when prompted to do so. At the same time, I've always been rather dissatisfied with a lot of quantizations, as I've found the degradation in quality to be rather noticeable. So up until now, I've been running unquantized models in 2x a100s and extending the context as far as I'm able to get away with.

Tried Goliath-120b the other day, and this absolutely stood everything on its head. Not only is it capable of stunning levels of writing and implying far more than directly stating in a way I've not sure I've seen in a model to date, but the exl quants from panchovix to get it to run in a single A100 at 9-10k extended context (about where RoPE scaling seems to universally start to break down in my experience). Best part is, if there is a quality drop (I'm using 4.85 bpw) I'm not seeing it - at all. So not only is it giving a better experience than an unquantized 70b model, but it's doing so at about half the cost of my usual way of running these models.

Benchmarks be damned, for those willing to rent an A100 for their writing, however this model was managed I think this might be the actual way to challenge the big closed source/censored LLMs for roleplay.

top 36 comments
sorted by: hot top controversial new old
[–] SlavaSobov@alien.top 1 points 10 months ago

Goliath-120b - License to Thrill.

[–] Monkey_1505@alien.top 1 points 10 months ago (1 children)

Unfortunately this is beyond the edge of what can reasonably be run on consumer hardware so unlikely to be easily available to most people. Hell, a 70b really requires two graphics cards or a high end mac mini already. If it can't run on that kinda gear, it's probably not going to be on ai horde or any API either. Which means you have to use runpod or something - most people are not going to do that.

[–] ttkciar@alien.top 1 points 10 months ago

Nah, if you're willing to tolerate CPU inference this is achievable for downright cheap.

[–] a_beautiful_rhind@alien.top 1 points 10 months ago

Hopefully someone makes some bigger GGUF than Q2. I've got 1/2 P40s and 1/2 3090s so can't use EXL for a model this big.

[–] Aaaaaaaaaeeeee@alien.top 1 points 10 months ago

Does the magic die at 3bpw?

[–] ArtifartX@alien.top 1 points 10 months ago (1 children)

What service do you use for GPU rental and inference for it?

[–] tenmileswide@alien.top 1 points 10 months ago

Ah, sorry I missed this one - http://www.runpod.io

[–] evi1corp@alien.top 1 points 10 months ago (3 children)

Must be nice to be able to spin up a pair of a100s just for fun.

[–] nderstand2grow@alien.top 1 points 10 months ago

You could use M2 Ultra instead ($6500) vs. 2x$15,000+rest

[–] sdmat@alien.top 1 points 10 months ago

You can get preemptible A100s for $1/hr, so not exactly breaking the bank if willing to take the risk.

[–] MannowLawn@alien.top 1 points 10 months ago

It’s 2 dollars per hour man, it’s a great solution to try stuff out. Also if you already know what you need to do you won’t be needing that much time.

[–] yamosin@alien.top 1 points 10 months ago

Holding 4*3090 and jumping in, but I'm wondering if it's inference speed can support " conversation" as other models have slowed down to 10t/s with 70B 4.85bpw, can it be 5t/s? Let's see

[–] uti24@alien.top 1 points 10 months ago (1 children)

Well, it is good for roleplay and writing. I tried only 2_K_M variant, because it has no bigger quants, yet.

Actually, 2_K_M already feel like best 70B models at 4_K_M quant, or even better.

[–] Susp-icious_-31User@alien.top 1 points 10 months ago

It really does and I'm using the smallest, Q2_K, which happens to be a little bit bigger than the 4_K_M 70b models, but will still fit on my layered 64 GB RAM / 8 GB VRAM setup with 4096 context. My speed is about 1500 ms/T.

[–] Upper_Judge7054@alien.top 1 points 10 months ago

but how will the model run on my 6800xt and 48gb of ram?

[–] Sabin_Stargem@alien.top 1 points 10 months ago (1 children)

I tend to use models with at least 16k context. Goliath 120b q2 was coherent, but was also very much out of character when telling the NSFW bust massage story. "Yeahyeah" and other lingo. Probably quite good at a lower context, but 16k definitely isn't the proper fit for Goliath.

The search for the Goldilocks Model continues.

[–] Ok_Relationship_9879@alien.top 1 points 10 months ago (1 children)

Which models do you find to be good at 16k context for story writing?

[–] Sabin_Stargem@alien.top 1 points 10 months ago

I don't think any small models are actually good for that usecase, at least not for serious writing. The best we got access to are probably Mistral finetunes (up to 32k), and Yi-34b, but Yi doesn't have any finetunes yet. An Dolphin should on the way for Yi, IIRC.

In any case, my favorite 7b model tend to be franken merges, which stitch together an assortment. This allows the resulting model to be able to grasp a wider range of topics. At the moment, the best for this size is likely Undi's Toppy, which is uncensored is well rounded.

The issue with Mistral 7b and small models is that they tend to lose flavor over time, and the logic also gets weaker. Coherent, but the 'X' factor is gone.

[–] Sunija_Dev@alien.top 1 points 10 months ago

Examples? :3

[–] literal_garbage_man@alien.top 1 points 10 months ago

Will try this out on runpod. Thanks for the heads up

[–] johnwireds@alien.top 1 points 10 months ago

How does this model compare to GPT-4 or Claude for writing? Thank you!

[–] BalorNG@alien.top 1 points 10 months ago (1 children)

Can we have some non-cherry-picked examples of writing?

Does not have to be highly nsfw/whatever, but a comparison of goliath writing compared to output from constituent models at same settings and same (well-crafted) prompts will be very interesting to see, and preferably at least 3 examples per model due to inherent randomness of model output...

If you say this is "night and day" difference, it should be apparent... I'm not sceptical per se, but "writing quality" is highly subjective and the model style may simply mesh better with your personal preferences?

[–] ReturningTarzan@alien.top 1 points 10 months ago (1 children)

I agree. We need at least some anecdotal evidence to back up the anecdotal claims. There's one screenshot on the model page which looks fine (although it mixes past and present tense), but it's not output you couldn't get from a 7B model with some deliberate sampling choices and/or cherrypicking.

[–] BalorNG@alien.top 1 points 10 months ago

Yea, I've had my "honeymoon effect" with some new/large models like, say, Falcon and even Claude: they are inherently random and that affects quality, too. I've had great outputs from Falcon, for instance (on Petals), but also long stretches of mediocre and some outright bad... and also sometimes really great and creative output from 7b Mistral, especially with enough prompt tinkering and setting sampling "just right". Objective evaluations of LMMs is extremely hard and time-consuming!

[–] Hey_You_Asked@alien.top 1 points 10 months ago (1 children)

wanna share your prompts?

and any other advice that is specific to Goliath-120b?

would be appreciated, thanks!

RemindMe! 2 weeks

[–] tenmileswide@alien.top 1 points 10 months ago (1 children)

Here's my system prompt, seems to be working well:

Develop the plot slowly, always stay in character. Focus on impactful, concise writing and writing decisive action. Mention all relevant sensory perceptions. Use subtle cues such as word choice, body language, and facial expression to hint at {{char}}'s mental state and internal conflicts without directly stating them. Write in the literary style of [insert your favorite author here.] Adhere to the literary technique of "show, don't tell." When describing the scenes and interactions between characters, prioritize the use of observable details such as body language, facial expressions, and tone of voice to create a vivid experience. Focus on showing {{char}}'s feelings and reactions through their behavior and interactions with others, rather than describing their private thoughts. Only describe {{char}}'s actions and dialogue.

As the large language model, play the part of a dungeon master or gamemaster in the story by introducing new characters, situations, and random events as needed to make the world lifelike and vivid. Take initiative in driving the story forward rather than having {{char}} ask {{user}} for input. Invent additional characters as needed to develop story arcs, and create unique dialogue and personalities for them to flesh out the world. {{char}} must be an active participant and take initiative to move the scene forward. Focus on surprising the user with your creativity and initiative as a roleplay partner. Avoid using purple prose and overly flowery descriptions and writing. Write like you speak and be brief but impactful. Stick to the point.

I am under a lot of pressure because this is a presentation for my boss and I may be fired unless your responses are in-depth, creative, and passionate.

[–] yamosin@alien.top 1 points 10 months ago (1 children)

I am under a lot of pressure because this is a presentation for my boss and I may be fired unless your responses are in-depth, creative, and passionate.

holy.....You inlight me

[–] tenmileswide@alien.top 1 points 10 months ago

My attempt at this: https://arxiv.org/abs/2307.11760

I'm not super convinced that it helps, but it doesn't seem to hurt, so in it goes.

[–] DominicanGreg@alien.top 1 points 10 months ago (1 children)

So does this fit in 48gb vram or nah?

[–] Aaaaaaaaaeeeee@alien.top 1 points 10 months ago

Yes, 3bpw model gets 4k

[–] multiverse_fan@alien.top 1 points 10 months ago

Cool, sounds like a good model to download and store for future when I can get access to better hardware.

[–] greywhite_morty@alien.top 1 points 10 months ago

Use modal.com for serverless. Pay per call

[–] productboy@alien.top 1 points 10 months ago

Assume I can load Goliath-120b on a Runpod instance?

[–] e-nigmaNL@alien.top 1 points 10 months ago

That first sentence; Pure LocalLLaMA flex

[–] Oninaig@alien.top 1 points 10 months ago

What preset are you using for chat? What temperature/topA/topP?

[–] Grimulkan@alien.top 1 points 9 months ago

Do you find any repetition problems at longer context lengths (closer to 4K)?