DrVonSinistro

joined 1 year ago
 

That's part of what it answered when I asked if it knew who created it. I was wondering if the researchers had baked in their names in the data.

I realised that these inabilities are my main hopes for the future of LLMs !

[–] DrVonSinistro@alien.top 1 points 11 months ago

Because a model can be divine or crap with some settings, I think its important I specify that I use:

Deepseek 33b q8 gguf with the Min-p setting (I love it very much)

Source of my Min-p settings: (1) Your settings are (probably) hurting your model - Why sampler settings matter : LocalLLaMA (reddit.com)

 

In text-generation-webui, the chat styles detect code badly, also compared to ChatGPT with the colors and all, we're far from that sophistication. Koboldcpp is even worst.

Are there better chat styles we can import? Or extensions?

[–] DrVonSinistro@alien.top 1 points 11 months ago

I run 2x P40s with 70b chat and 8k ctx I get 7-8 T/s and I'm very happy with that. Anything above 5 is awesome for me.

 

I have a Dell PowerEdge r730 with 2 cpu, 2 psu 1100w and I got the proper gpu cable 8pins EPS that I verified with the multimeter. It is plugged in the right way and I go get good power on the yellow wires. I installed every drivers I found and did the registry shenanigans to have it in wddm mode etc but it always fails with:

This device cannot start. (Code 10)
Insufficient system resources exist to complete the API.

That server is using Windows Server 2022. I used to run a GTX 1070 Ti in there (with the correct cable which is different) and it worked A1

Please HALP !

 

I just wanted to leave out there that tonight I tested what happen when you try to run oobabooga with 8x 1060 GTX on a 13B model.

First of all it works like perfectly. No load on the cpu and 100% equal load on all gpu's.

But sadly, those usb cables for the risers dont have the bandwidth to make it a viable option.

I get 0.47 token/s

So for anyone that Google this shenanigan, here's the answer.

*EDIT

I'd add that CUDA computing is equally shared across the card but not the vram usage. A LOT of vram is wasted in the process of sending data to compute to the other cards.