SomeOddCodeGuy

joined 10 months ago
[–] SomeOddCodeGuy@alien.top 1 points 9 months ago

Wow, I've never seen an fp16 gguf before. Holy crap, I wish there were more of those out there; I'd love to get my hands on some for 70b models or the like. I didn't realize unquantized gguf was an option

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago

I have a mac studio as my main inference machine.

My opinion? RAM and Bandwidth > all, IMO, Personally, I would pick A as it's the perfect in-between. At 64GB of RAM you should have around 48GB or so of usable VRAM without any kernel/sudo shenanigans (Im excited to try some of the recommendations folks have given here lately to change that), and you get the 400GB/s bandwidth.

My Mac Studio has 800GB/s bandwidth, and I can run 70b q8 models... but at full context, it requires a bit of patience. I imagine a 70b would be beyond frustrating at 300GB/s bandwidth. While the 96GB model could run a 70b q8... I don't really know that I'd want to, if I'm being honest.

My personal view is that on a laptop like that, I'd want to max out on the 34b models, as those are very powerful and would still run at a decent speed on the laptop's bandwidth. So if all I was planning to run was 34b models, a 34b q8 with 16k context would fit cleanly into 48GB and I'd earn an extra 100GB/s of bandwidth for the choice.

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago

I've seen a couple of YARN models, but I honestly have no idea how to use them lol. That and the mistral models; they always want to load up at 32k tokens, but then coherency of the model just dies after 5k. I can't find really clear instructions on what's expected to get maximum context value from either, so I tend to just ignore using either at high context.

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago (1 children)

Awesome! I think I remember us talking about this at some point, but I didn't have the courage to try it on my own machine. You're the first person I've seen actually do the deed, and now I want to as well =D The 192GB Mac Studio stops at 147GB... I also run headless, so I can't fathom that stupid bricks really needs 45GB of RAM to do normal stuff lol.

I am inspired. I'll give it a go this weekend! Great work =D

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago

This little bit right here is very important if you want to do work regularly with an AI

Specifying a role when prompting can effectively improve the performance of LLMs by at least 20% compared with the control prompt, where no context is given. Such a result suggests that adding a social role in the prompt could benefit LLMs by a large margin.

I remembered seeing an article about this a few months back, which lead to my working on an Assistant prompt, and it's been hugely helpful.

I imagine this comes down to how Generative AI works under the hood. It ingested tons of books, tutorials, posts, etc from people who identified as certain things. Telling it to also identify as that thing could open a lot of pieces of information to it that it wouldn't otherwise be looking at.

I always recommend that folks set up roles for their AI when working with it, because the results I've personally seen have been miles better when you do.

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago (3 children)

Right. This part right here is very suspicious to me, and I'm taking their claims with a grain of salt.

No! The model is not going to be available publically. APOLOGIES. The model like this can be misused very easily. The model is only going to be provided to already selected organisations.

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago (4 children)

The results for the 120b continue to absolutely floor me. Not only is it performing that well at 3bpw, but it's an exl2 as well, which your own tests have shown perform worse than gguf. So imagine what a q4 gguf does if a q3 equivalent exl2 can do this.

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago (1 children)

I imagine there's a lot of work to do so, but I can't imagine it's impossible. Probably just not something folks are working on.

I don't particularly mind too much, because the quality difference between exl2 and gguf is hard for me to work past. Just last night I was trying to run this NeuralChat 7b everyone is talking about on my windows machine in 8bpw exl2, and it was SUPER fast, but the model was so easily confused; before giving up on it, I grabbed the q8 gguf and swapped to it (with no other changes) and suddenly saw why everyone was saying that model is so good.

I don't mind speed loss if I get quality, but I can't handle quality loss to get speed. So for now, I really don't mind only using gguf, because it's perfect for me.

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago (4 children)

M2 Ultra user here. I threw some numbers up for token counts: https://www.reddit.com/r/LocalLLaMA/comments/183bqei/comment/kaqf2j0/?context=3

Does a big memory let you increase the context length with smaller models where the parameters don't fill the memory?

With the 147GB of VRAM I have available, I'm pretty sure I could use all 200k tokens available in a Yi 34b model, but I'd be waiting half an hour for a result. I've done up to 50k in CodeLlama, and it took a solid 10 minutes to get a response.

The M2 Ultra's big draw is its big RAM; its not worth it unless you get the 128GB model or higher. You have to understand that the speed of the M2 ultra doesn't remotely compare to something like a 4090; CUDA cards are gonna leave us in the dust.

Another thing to consider is that we can only use ggufs via Llamacpp; there's no support for anything else. In that regard, I've seen people put together 3x or more Tesla P40 builds that have the exact same limitation (can only use Llamacpp) but cost half the price or less.

I chose the M2 Ultra because it was easy. Big VRAM, and it took me less than 30 minutes from the moment I got the box to be chatting to a 70b q8 on it. But if speed or price are a major consideration, moreso than level of effort to set up? In that case the M2 ultra would not be the answer.

[–] SomeOddCodeGuy@alien.top 1 points 9 months ago

TheBloke just quantized his newest version of this model. I'm downloading it right now =D

But I'm with you- Capybara-Tess-Yi is amazing; I don't RP so I can't speak to that, but for a conversational model that does basica ChatGPT tasks? It's amazing.

[–] SomeOddCodeGuy@alien.top 1 points 10 months ago

Hmm... I'm afraid I personally am not sure on the answer of that, though I do recommend checking out these tests, as Wolfram does tests where the models do stuff back and forth between German and English.

https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_comparisontest_2x_34b_yi_dolphin_nous/

[–] SomeOddCodeGuy@alien.top 1 points 10 months ago (2 children)

The most multi-lingual capable model I'm aware of is OpenBuddy 70b. I use it as a foreign language tutor, and it does an ok job. I constantly check it against google translate, and it hasn't let me down yet, but ymmv. I don't use it a ton.

I think the problem is that, in general, technology hasn't been the best at foreign language translations. Google Translate is SOTA in that realm, and it's not perfect. I'm not sure I'd trust it for doing this in a real production sense, but I do trust it enough to help me learn just enough to get by.

So with that said, you could likely get halfway far mixing any LLM with a handful of tools. For example- SillyTavern I believe has a Google Translate module built in. You could use Google to do the translations. Then, having multiple speech to text/text to speech modules, one for each language, might give you that flexibility of input and output.

Essentially, I would imagine that 90% of the work will be developing tooling around any decent LLM, regardless of its language abilities, and then using external tooling to support that. I could be wrong, though.

 

UPDATE: I forgot to mention that I used q8 of all models.

So I've had a lot of interest in the non-code 34b Finetunes, whether it's CodeLlama base or Yi base. From the old Samantha-34b and Synthia-34b to the new Dolphin-Yi and Nous-Capybara 34b models, I've been excited for each one because it fills a gap that needs filling.

My problem is that I can't seem to wrangle these fine-tunes into working right for me. I use Oobabooga (text-gen-ui), and always try to choose the correct instruction template either specified on the card or on TheBloke's page, but the models never seem to be happy with the result, and either get confused very easily or output odd gibberish from time to time.

For both Yi models, I am using the newest ggufs that TheBloke put out... yesterday? Give or take. But I've tried the past 2-3 different ggufs for the same model he's updated with when they came out.

The best luck I've had with the new Yi models was doing just plain chat mode with my AI Assistant's character prompt as the only thing being sent in, but even then both Yi fine-tunes that I tried eventually broke down after a few thousand context.

For example, after a bit of chattering with the models I tried a very simple little test on both: "Please write me two paragraphs. The content of the paragraphs is irrelevant, just please write two separate paragraphs about anything at all." I did that because previous versions of these two struggled to make a new line, so I just wanted to see what would happen. This absolutely confused the models, and the results were wild.

Has anyone had luck getting them to work? They appear to have so much potential, especially Nous Capybara which went toe to toe with GPT-4 in this benchmark, but I'm failing miserably at unlocking its full potential lol. If you have gotten it to work, could you please specify what settings/instructions you're using?

 
 

So one thing that had really bothered me was that recent Arxiv paper claiming that despite GPT 3 being 175B, and GPT 4 being around 1.7T, somehow 3.5 Turbo was 20b.

This had been on my mind for the past couple of days because it just made no sense to me, so this evening I went to go check out the paper again, and noticed that I could not download the PDF or postscript. Then I saw this update comment on the Arxiv page, added yesterday:

Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion

That link leads to a Forbes article, from before GPT 4 even released, that claims that ChatGPT in general is 20b parameters.

It seems like the chatbot application was one of the most popular ones, so ChatGPT came out first. ChatGPT is not just smaller (20 billion vs. 175 billion parameters) and therefore faster than GPT-3, but it is also more accurate than GPT-3 when solving conversational tasks—a perfect business case for a lower cost/better quality AI product.

So it would appear that they sourced that knowledge from Forbes, and after everyone got really confused they realized that it might not actually be correct, and the paper got modified.

So, before some wild urban legend forms that GPT 3.5 is 20b, just thought I'd mention that lol.

 

I keep seeing people posting about how the new Phind is the most amazing thing on the planet, and I kept thinking "We already have Phind... see, I have the gguf right here!"

I finally looked at the phind paper on their newest model, and it says that their current model is v7.

https://www.phind.com/blog/phind-model-beats-gpt4-fast

:O Huggingface only goes up to v2.

I can't tell if Phind is a proprietary model that just happened to give us an older version, if there will be newer versions coming out, or what. Does anyone happen to know?

view more: next ›