overview for Desm0nt

55B Yi model merges in c/localllama@poweruser.forum

[–] Desm0nt@alien.top 1 points 9 months ago

Hm. I just load gguf yi-34b-chat q4_k_m in oobabooga via llama.cpp with default params and 8k context and it's just work like a charm. Better (more lively language) than any 70b from openrouter (my local machine can't handle 70b)

Question about GGUF, gpu offload and performance in c/localllama@poweruser.forum

[–] Desm0nt@alien.top 1 points 9 months ago

By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Sound suspicious. A use Yi-Chat-34b-Q4_K_M on old 1080ti (11 gb VRAM) with 20 layers offloaded and got around 2.5 t/s.But it is on Threadripper 2920 with 4 channel RAM (also 3200). However I don't think it would make that much difference. Ofcourse in 4 channel I have ram bandwidth x2 of your's but I run 34b and I load only 20 layers on gpu...

1

Silly questions about GGUF and exl2 (alien.top)

submitted 9 months ago by Desm0nt@alien.top to c/localllama@poweruser.forum

1 comments fedilink

Hi. I have LLaMA2-13B-Tiefighter-exl2_5bpw and (probably) the same LLaMA2-13B-Tiefighter.Q5_K_M.

I run it on 1080Ti and old threadripper with 64 4-channel DDR4-3466. I use oobabooga (for GGUF and exl2) and LMStudio. I have 531.68 Nvidia driver (so I recieve OOM, not RAM-swapping when VRAM overflows).

1st question: I read that exl2 consume less vram and work faster than gguf. I try to load it on Oobabooga (ExLlamaV2_HF) and it fits in my 11gb VRAM consume ~10gb) but produce only 2.5 t/s, while GGUF (lama.cpp backend) with 35 layers offloaded on GPU - 4.5 t/s. Why? I don't set some important settings?

2nd question: In LMStudio (lama.cpp backend?) with the same settings and same gpu offloaded 35 layers I got only 2.3 t/s. Why? Same backend, same GGUF, same settings for sampling and context.

New Claude 2.1 Refuses to kill a Python process :) in c/localllama@poweruser.forum

[–] Desm0nt@alien.top 1 points 10 months ago

Answers like this (I can do no harm) to questions like this clearly show how dumb LLMs really are and how far away we are from AGI. They have absolutely no idea basically what they are being asked and what their answer is. Just a cool big T9 =)

In light of this, the drama in OpenAI with their arguments about the danger of AI capable of destroying humanity looks especially funny.

New Claude 2.1 Refuses to kill a Python process :) in c/localllama@poweruser.forum

[–] Desm0nt@alien.top 1 points 10 months ago

Mind if we use this as a default chain response on Anthropic's twitter account along with that "we can't write stories about minorities writing about their experiences being oppressed" response?

Now tell the model that the process had child processes and ask its opinion about it =)

The newly released Psyfighter2 13B, A better version of Tiefighter? in c/localllama@poweruser.forum

[–] Desm0nt@alien.top 1 points 10 months ago (1 children)

Is it still only 4k context size?

I hope one day someone somehow find a way to extend context of Tiefighter atleast to 8k.
Because it's the perfect model for real-time RP and stories even on weak PCs. It's smarter than all 7b and 13b models and smarter than many 30b models, but the modest context of 4k tokens is eaten up faster than you can enjoy its potential...