LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Help me get llama running on a dual-socket, 8-channel system. (2x8) (alien.top)

submitted 2 years ago by MindlessEditor2762@alien.top to c/localllama@poweruser.forum

3 comments fedilink hide all child comments

I know this may be a lot to ask, but if there are any people interested in seeing how fast this can go, please help me out with your ideas here.

So hardware-wise what I got is an H11DSI-NT dual socket SP3 motherboard, bios version 2.1 with 2 32-core EPYC 7502s and all 16 memory slots populated by 256GB (128GB per socket) of 3200MT/s DDR4 sticks certified for this board by Supermicro, totaling 8 memory channels on each of the 2 sockets.

For BIOS set-up I have read the motherboard manual, AMD implementation and tuning guides. As suggested for high memory bandwidth HPC (subtype CFD) workloads, I disabled SMT (Hyperthreading), set the NUMA nodes per socket to 4 (NPS4), locked memory speed at 3200 to prevent matching the NB at 2933, and the 4-link xGMI speed from 10 to 16Gbps (max 18).

For the OS I have installed Clear Linux, as supposedly it comes with OOTB tuning for HPC workloads. So far I've had some problems however. The stateless config isn't so intuitive, which would be alright if not for the documentation being actually wrong in some places, possibly outdated. I've researched some configuration options to improve performance but none seemed to have a positive effect in my preliminary testing, so I have not applied any permanently. The OS is mostly stock for now.

How well does it actually work? Works as expected as far as I can tell. The main reason I got this rig is rendering and it rips, so that's fine. As for memory, after limited tweaking here are the results of my STREAM benchmark:

STREAM version $Revision: 5.10 $
This system uses 8 bytes per array element.
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Number of Threads requested = 64
Number of Threads counted = 64
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 514 microseconds.
(= 514 clock ticks)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:         2867900.2     0.000067     0.000056     0.000122
Scale:        2761681.6     0.000065     0.000058     0.000100
Add:          3078388.3     0.000081     0.000078     0.000086
Triad:        3078388.3     0.000094     0.000078     0.000199

Pretty good I think. Better than what the openSUSE guide for HPC I read promised at 250, but not as good as AMD's technical tuning guide which promised 350. However this appears highly variable for me, the result above is merely the best of 6, while the worst of 6 was a pathetic 1/3rd of it. No idea if that means anything.

Finally, for actual inference I got ggreganov/llama.cpp and lostruins/koboldcpp. I must say I am not at all familiar with running on CPU as all of my usage until now was done on my GPU system, this one has no GPUs (beyond the motherboard VGA controller).

So far I've tried 2 models, TheBloke/nous-capybara-34b.Q4_K_M.gguf (Yi model), and Sao10K/Euryale-1.4-L2-70B.q5_K_S.gguf (llama2). The speed has been thoroughly disappointing, with the fastest result being:

llama_print_timings:        load time =   15986.07 ms
llama_print_timings:      sample time =      66.48 ms /   100 runs   (    0.66 ms per token,  1504.28 tokens per second)
llama_print_timings: prompt eval time =     848.76 ms /    20 tokens (   42.44 ms per token,    23.56 tokens per second)
llama_print_timings:        eval time =   38684.95 ms /    99 runs   (  390.76 ms per token,     2.56 tokens per second)
llama_print_timings:       total time =   39667.94 ms
Log end

As a result of:

./main -m ../Nous-Capybara-34B-Q4_K_M.gguf -p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 100 -t 32 -tb 64 --mlock --no-mmap --numa

The 70B model runs at about half that, best case.

I can download other models, software, apply any BIOS tweaks and OS settings, compile with whatever flags, EVs and hacks if necessary.

Thank you if you read this far, but I apologize if I probably don't respond until tomorrow. I will be going to sleep any minute now.

you are viewing a single comment's thread
view the rest of the comments

[–] Flying_Madlad@alien.top 1 points 2 years ago (2 children)

Bro, lol, I'm building a similar system, but I'm gonna need ChatGPT to translate some of that for me

[–] MindlessEditor2762@alien.top 1 points 2 years ago (1 children)

Let me know how it goes for you. Sorry if I was too verbose in my OP, mostly just repeating what the AMD guide says.

[–] Flying_Madlad@alien.top 1 points 2 years ago

Lol, no worries, this is the part where I'm out of my league, but I'm pulling it together. I got this one:

https://www.tyan.com/Motherboards=S8026=S8026GM2NRE=description=EN#:~:text=Tyan's%20Tomcat%20SX%20S8026%20is,performance%20within%20a%201P%20footprint.

Mini-itx form factor but you get all that juicy PCIe exposed for accelerators. I think we even got the same CPU, but I could only afford 256gb DDR4. Kitting it out with the full 2TB is going to be exhausting! And also I need more GPUs.