I know this may be a lot to ask, but if there are any people interested in seeing how fast this can go, please help me out with your ideas here.
So hardware-wise what I got is an H11DSI-NT dual socket SP3 motherboard, bios version 2.1 with 2 32-core EPYC 7502s and all 16 memory slots populated by 256GB (128GB per socket) of 3200MT/s DDR4 sticks certified for this board by Supermicro, totaling 8 memory channels on each of the 2 sockets.
For BIOS set-up I have read the motherboard manual, AMD implementation and tuning guides. As suggested for high memory bandwidth HPC (subtype CFD) workloads, I disabled SMT (Hyperthreading), set the NUMA nodes per socket to 4 (NPS4), locked memory speed at 3200 to prevent matching the NB at 2933, and the 4-link xGMI speed from 10 to 16Gbps (max 18).
For the OS I have installed Clear Linux, as supposedly it comes with OOTB tuning for HPC workloads. So far I've had some problems however. The stateless config isn't so intuitive, which would be alright if not for the documentation being actually wrong in some places, possibly outdated. I've researched some configuration options to improve performance but none seemed to have a positive effect in my preliminary testing, so I have not applied any permanently. The OS is mostly stock for now.
How well does it actually work? Works as expected as far as I can tell. The main reason I got this rig is rendering and it rips, so that's fine. As for memory, after limited tweaking here are the results of my STREAM benchmark:
STREAM version $Revision: 5.10 $
This system uses 8 bytes per array element.
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Number of Threads requested = 64
Number of Threads counted = 64
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 514 microseconds.
(= 514 clock ticks)
Function Best Rate MB/s Avg time Min time Max time
Copy: 2867900.2 0.000067 0.000056 0.000122
Scale: 2761681.6 0.000065 0.000058 0.000100
Add: 3078388.3 0.000081 0.000078 0.000086
Triad: 3078388.3 0.000094 0.000078 0.000199
Pretty good I think. Better than what the openSUSE guide for HPC I read promised at 250, but not as good as AMD's technical tuning guide which promised 350. However this appears highly variable for me, the result above is merely the best of 6, while the worst of 6 was a pathetic 1/3rd of it. No idea if that means anything.
Finally, for actual inference I got ggreganov/llama.cpp and lostruins/koboldcpp. I must say I am not at all familiar with running on CPU as all of my usage until now was done on my GPU system, this one has no GPUs (beyond the motherboard VGA controller).
So far I've tried 2 models, TheBloke/nous-capybara-34b.Q4_K_M.gguf (Yi model), and Sao10K/Euryale-1.4-L2-70B.q5_K_S.gguf (llama2). The speed has been thoroughly disappointing, with the fastest result being:
llama_print_timings: load time = 15986.07 ms
llama_print_timings: sample time = 66.48 ms / 100 runs ( 0.66 ms per token, 1504.28 tokens per second)
llama_print_timings: prompt eval time = 848.76 ms / 20 tokens ( 42.44 ms per token, 23.56 tokens per second)
llama_print_timings: eval time = 38684.95 ms / 99 runs ( 390.76 ms per token, 2.56 tokens per second)
llama_print_timings: total time = 39667.94 ms
Log end
As a result of:
./main -m ../Nous-Capybara-34B-Q4_K_M.gguf -p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 100 -t 32 -tb 64 --mlock --no-mmap --numa
The 70B model runs at about half that, best case.
I can download other models, software, apply any BIOS tweaks and OS settings, compile with whatever flags, EVs and hacks if necessary.
Thank you if you read this far, but I apologize if I probably don't respond until tomorrow. I will be going to sleep any minute now.
Let me know how it goes for you. Sorry if I was too verbose in my OP, mostly just repeating what the AMD guide says.
Lol, no worries, this is the part where I'm out of my league, but I'm pulling it together. I got this one:
https://www.tyan.com/Motherboards=S8026=S8026GM2NRE=description=EN#:~:text=Tyan's%20Tomcat%20SX%20S8026%20is,performance%20within%20a%201P%20footprint.
Mini-itx form factor but you get all that juicy PCIe exposed for accelerators. I think we even got the same CPU, but I could only afford 256gb DDR4. Kitting it out with the full 2TB is going to be exhausting! And also I need more GPUs.