overview for panchovix

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 9 months ago (1 children)

You can use Alpha scaling, to get more context. You will lose a bit of ppl as you increase ctx. 1.75 alpha for 1.5x context, and 2.5 alpha for 2x context, if I'm not wrong. You can try freely since you're on the cloud.

I guess you're trying the 4.85bpw one? A single 80GB GPU may do more context but not that much. Now, if it's 2x48GB then you have more slack.

1

List of all GPUs and dedicated processors for AI workloads (gpus4ai.edlabs.it)

submitted 9 months ago by panchovix@alien.top to c/localllama@poweruser.forum

3 comments fedilink

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 9 months ago

I've posted the calibration dataset (on a link) on the goliath-calrp quant and the measurement, if you want or would like to do another quant with different sizes.

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 9 months ago

Hi there, nice work there with Venus. For your next version and exl2 quants, you maybe want to the calibration dataset from this https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal

(On the description)

Since I checked the one that you used first and is well the same, but without any fix or formatting (so it has weird symbols etc)

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 9 months ago (1 children)

Venus is 139 layers instead of 137 of goliath, so it weights a bit more.

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 9 months ago (1 children)

Great post, glad you enjoyed both of my Goliath quants :)

Venus-120b: A merge of three different models in the style of Goliath-120b in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 9 months ago

Models on ooba without "exl" on the folder name will redirect to transformers by default, so that may be the reason he got that by default.

Discrepancy between TheBloke_Orca-2-13B-GPTQ and the original one with the tested logic question in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 10 months ago

Try at FP16 or 8bit at most, probably a 13B models suffers too much at 4bits.

Dual 3090,24GB & 1070 worth it? in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 10 months ago

it will work but you will be limited to 1070 speeds if using all 3 gpus

TabbyAPI released! A pure LLM API for exllama v2. in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 10 months ago

By the hard work of kingbri, Splice86 and turboderp, we have a new API loader for LLMs using the exllamav2 loader! This is on a very alpha state, so if you want to test it may be subject to change and such.

TabbyAPI also works with SillyTavern! Doing some special configurations, it can work as well.

As a reminder, exllamav2 added mirostat, tfs and min-p recently, so if you used those on exllama_hf/exllamav2_hf on ooba, these loaders are not needed anymore.

Enjoy!

1

TabbyAPI released! A pure LLM API for exllama v2. (github.com)

submitted 10 months ago by panchovix@alien.top to c/localllama@poweruser.forum

6 comments fedilink

🐺🐦‍⬛ LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ) in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 10 months ago (4 children)

The major reason I use exl2 is speed, like on 2x4090 I get 15-20 t/s at 70b depending of the size, but GGUF I get like tops 4-5 t/s.

When using 3 gpus (2x4090+1x3090), it is 11-12 t/s at 6.55bpw vs GGUF Q6_K that runs at 2-3 t/s.

Though I agree with you, for model comparisons and such you need to have deterministic results and also the best quality.

If you can sometime, try 70b at 6bpw or more, IMO it is pretty consistent and doesn't have issues like 5bpw/bits.

The performance hit is too much on multigpu systems when using GGUF. I guess if in the future the speed gets to the same level, I would use it most of the time.

DreamGen Opus 70B — Uncensored model for story telling and chat / roleplay in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 10 months ago

Great work!

Will upload some exl2 quants in about 4-5 hours here https://huggingface.co/Panchovix/opus-v0-70b-exl2 (thinking for now about 2.5, 4.65 and 6bpw (I use the latter))

Also, uploaded a safetensors conversion here, if you don't mind https://huggingface.co/Panchovix/opus-v0-70b-safetensors

If you don't want the safetensors up, I can remove it.

Where and how to run Goliath 120b GGUF with good performance? in c/localllama@poweruser.forum

[–] panchovix@alien.top 1 points 10 months ago (1 children)

I tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn't use 8bit cache

Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.