this post was submitted on 26 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Well my issues with the p40 are mainly that it's not fp16-capable, bandwidth is super low at only 346GB/s, and it requires elongation due to the added blower fan from probably ebay. I do agree it's a big brain move for budget builders though.
My issue with 24GB cards mainly stems from transferring data between two cards over PCI-e. We know that a 70b on one 48GB vs 2x 24GB will perform consistently better. Again, it's really negligible if you only have the budget for a dual 3090 build or something.
I do work extensively with OpenVINO and ONNX as a software developer so I'm not too worried about any issues with the platforms working together (I've managed to make them play nice one way or another for most things). This is actually why I was leaning more into the dual Xeon platinums or golds instead of the Epyc/Threadripper deal. PCI lanes are plentiful either way though.
For the P920, the goal would mainly be to just have a q4 or q5 70b run on the 48GB but auxiliary models like embedding fetchers, semantic stuff, QA, etc. would be on something like an a770 due to the specs-to-price ratio. I don't really need the RAM and I logically figured that I wouldn't need more than 64GB dedicated to ML functions since even AVX-512 won't make up for the slowness of running something larger, imo.
I can only see myself having more than two cards in a machine working together if I could include Nvlink or something.
Eventually most of the. Things I make will be going to prod, so I also need to make sure I keep in-mind that I'm more likely to get a good deal on cloud xeons like sapphire rapids and a single big card vs an epyc with many smaller cards.