overview for xinranli

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything! in c/localllama@poweruser.forum

[–] xinranli@alien.top 1 points 11 months ago (3 children)

This seems like a very brilliant and almost obvious idea, is there a reason why this method wasn't a thing before? Besides the PCIe bandwidth and storage speed requirements.

Does Dual EPYC work for LLMs? in c/localllama@poweruser.forum

[–] xinranli@alien.top 1 points 11 months ago (1 children)

Setting things up was straight forward, the process is no different to building a commercial or workstation platform. The 7773X machine runs Window 10 and I have another 7452 QS machine that runs Ubuntu. Both are mostly pain free. I have EPYC boards from both Supermicro and ASRock. I find the ASRock board to be more "modern" and has a better BIOS, but Supermicro has slightly better community and official support. In the very early Naples era AMD's BIOS had some GPU compatibility issues, but I think nowadays you can use any GPU you want.

You can get very cheap Genoa engineering samples or qualification samples off eBay so you can skip the older DDR4 platforms. Their sockets are very different, you wouldn't even be able to reuse the heatsink.

One thing to watch out for when buying EPYCs is to definitely avoid vendor locked CPUs. Any EPYC CPUs once installed in a DELL or Lenovo board will be physically altered forever to not be able to boot on any other board. I got one once and it was a debugging nightmare until I realized the CPU was intentionally bricked by DELL...

Does Dual EPYC work for LLMs? in c/localllama@poweruser.forum

[–] xinranli@alien.top 1 points 11 months ago (3 children)

Are you trying to run Falcon 180B or something? I think it will probably work but not very well? I'd love to see you give it a try though.

When running two socket set up, you get 2 NUMA nodes. I am uncertain how llama.cpp handles NUMA but if it does handle it well, you might actually get 2x the performance thanks to the doubled total memory bandwidth. This is however quite unlikely.

You can get OK performance out of just a single socket set up. I have tried Falcon 180B Q4 GGML on my single 7773X with 512GB 8 channel 3200 RDIMM. I think it I was getting around 2 tokens/s. With a Genoa platform, you have 12 channel DDR5 5200 and AVX-512 support, it could be very usable just with 1 CPU.

Venus-120b: A merge of three different models in the style of Goliath-120b in c/localllama@poweruser.forum

[–] xinranli@alien.top 1 points 11 months ago

Great work! Does anyone happen to have a guide, tutorial, or paper on how to combine or interleave models together? I would also love to try it out frankensteining models

disappointed by trainers in c/localllama@poweruser.forum

[–] xinranli@alien.top 1 points 1 year ago

I recommend following some fine-tune tutorials to train a history oriented model yourself. You can get decent result with a few megabytes of good quality dataset about the history content you are interested in. It should be a much more interesting activity than testing models all day! If you want the model to recall intricate details, use higher rank LoRAs or try full fine tune rather than parameter efficient fine tunes.

But like others said, open source models we have today are still far from GPT-4. Fine tuning a small model also barely add any new capability to the model, it is only "tuning" it to be knowledgeful in something else. These LLMs are pre-trained with trillions of tokens, a few tens of thousands more will not make it any smarter.