panchovix

joined 11 months ago
[–] panchovix@alien.top 1 points 9 months ago (1 children)

You can use Alpha scaling, to get more context. You will lose a bit of ppl as you increase ctx. 1.75 alpha for 1.5x context, and 2.5 alpha for 2x context, if I'm not wrong. You can try freely since you're on the cloud.

I guess you're trying the 4.85bpw one? A single 80GB GPU may do more context but not that much. Now, if it's 2x48GB then you have more slack.

[–] panchovix@alien.top 1 points 9 months ago

I've posted the calibration dataset (on a link) on the goliath-calrp quant and the measurement, if you want or would like to do another quant with different sizes.

[–] panchovix@alien.top 1 points 9 months ago

Hi there, nice work there with Venus. For your next version and exl2 quants, you maybe want to the calibration dataset from this https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal

(On the description)

Since I checked the one that you used first and is well the same, but without any fix or formatting (so it has weird symbols etc)

[–] panchovix@alien.top 1 points 9 months ago (1 children)

Venus is 139 layers instead of 137 of goliath, so it weights a bit more.

[–] panchovix@alien.top 1 points 9 months ago (1 children)

Great post, glad you enjoyed both of my Goliath quants :)

[–] panchovix@alien.top 1 points 9 months ago

Models on ooba without "exl" on the folder name will redirect to transformers by default, so that may be the reason he got that by default.

[–] panchovix@alien.top 1 points 10 months ago

Try at FP16 or 8bit at most, probably a 13B models suffers too much at 4bits.

[–] panchovix@alien.top 1 points 10 months ago

it will work but you will be limited to 1070 speeds if using all 3 gpus

[–] panchovix@alien.top 1 points 10 months ago

By the hard work of kingbri, Splice86 and turboderp, we have a new API loader for LLMs using the exllamav2 loader! This is on a very alpha state, so if you want to test it may be subject to change and such.

TabbyAPI also works with SillyTavern! Doing some special configurations, it can work as well.

As a reminder, exllamav2 added mirostat, tfs and min-p recently, so if you used those on exllama_hf/exllamav2_hf on ooba, these loaders are not needed anymore.

Enjoy!

[–] panchovix@alien.top 1 points 10 months ago (4 children)

The major reason I use exl2 is speed, like on 2x4090 I get 15-20 t/s at 70b depending of the size, but GGUF I get like tops 4-5 t/s.

When using 3 gpus (2x4090+1x3090), it is 11-12 t/s at 6.55bpw vs GGUF Q6_K that runs at 2-3 t/s.

Though I agree with you, for model comparisons and such you need to have deterministic results and also the best quality.

If you can sometime, try 70b at 6bpw or more, IMO it is pretty consistent and doesn't have issues like 5bpw/bits.

The performance hit is too much on multigpu systems when using GGUF. I guess if in the future the speed gets to the same level, I would use it most of the time.

[–] panchovix@alien.top 1 points 10 months ago

Great work!

Will upload some exl2 quants in about 4-5 hours here https://huggingface.co/Panchovix/opus-v0-70b-exl2 (thinking for now about 2.5, 4.65 and 6bpw (I use the latter))

Also, uploaded a safetensors conversion here, if you don't mind https://huggingface.co/Panchovix/opus-v0-70b-safetensors

If you don't want the safetensors up, I can remove it.

[–] panchovix@alien.top 1 points 10 months ago (1 children)

I tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn't use 8bit cache

Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.

view more: next β€Ί