candre23

joined 1 year ago
[–] candre23@alien.top 1 points 11 months ago (2 children)

we have expanded the context window length to 32K

Kinda buried the lead here. This is far and away the biggest feature of this model. Here's hoping it's actually decent as well!

[–] candre23@alien.top 1 points 11 months ago

It's a new foundational model, so some teething pains are to be expected. Yi is heavily based on (directly copied, for the most part) llama2, but there are just enough differences in the training parameters that default llama2 settings don't get good results. KCPP has already addressed the rope scaling, and I'm sure it's only a matter of time before the other issues are hashed out.

[–] candre23@alien.top 1 points 11 months ago (1 children)

70b models will be extremely slow on pure CPU, but you're welcome to try. There's no point in looking on "torrent sites" for LLMs - literally everything is hosted on huggingface.

[–] candre23@alien.top 1 points 11 months ago (3 children)

Yes, your GPU is too old to be useful for offloading, but you could still use it for prompt processing acceleration at least.

With your hardware, you want to use koboldCPP. This uses models in GGML/GGUF format. You should have no issue running models up to 120b with that much RAM, but large models will be incredibly slow (like 10+ minutes per response) running on CPU only. Recommend sticking to 13b models unless you're incredibly patient.

[–] candre23@alien.top 1 points 11 months ago (3 children)

All yi models are extremely picky when it comes to things like prompt format, end string, and rope parameters. You'll get gibberish from any of them unless you get everything set up just right, at which point they perform very well.

[–] candre23@alien.top 1 points 11 months ago (1 children)

It's adorable that you think any 13b model is anywhere close to a 70b llama2 model.

[–] candre23@alien.top 1 points 11 months ago

Anywhere from 1 to several hundred GB. Quantized (compressed), the most popular models are 8-40gb each. LORAs are a lot smaller, but full models take up a lot of space.

[–] candre23@alien.top 1 points 11 months ago

No idea why you would need ~1800GB vram.

Homeboy's waifu is gonna be THICC.

[–] candre23@alien.top 1 points 11 months ago

Extremely effective and definitely the quietest option, but requires a lot of space: https://www.printables.com/model/484282-nvidia-tesla-p40-120mm-blower-fan-adapter-straight

[–] candre23@alien.top 1 points 11 months ago

The ONLY pascal card worth bothering with is the P40. It's not fast, but it's the cheapest way to get a whole bunch of usable vram. Nothing else from that generation is worth the effort.

[–] candre23@alien.top 1 points 11 months ago

Is this the beginning of the end of CUDA dominance?

Not unless intel/AMD/MS/whoever ramps up their software API to the level of efficiency and just-works-edness that cuda provides.

I don't like nvidia/cuda any more than the next guy, but it's far and away the best thing going right now. If you have an nvidia card, you can get the best possible AI performance from it with basically zero effort on either windows or linux.

Meanwhile, AMD is either unbearably slow with openCL, or an arduous slog to get rocm working (unless you're using specific cards on specific linux distros). Intel is limited to openCL at best.

Until some other manufacturer provides something that can legitimately compete with cuda, cuda ain't going anywhere.

view more: next ›