Number one cause of random hard crashes/hangs is RAM. Re-seat it, replace it, down-clock it, run a single stick, do everything you can to either rule it out as a problem, or to isolate the problem to a particular module or channel.
merkuron
Maybe it’s clear this way:
For every 2 lanes you want allocated to the PCIe slot (up to 4), you lose two SATA lanes. Since there are 8 lanes total, but 12 possible lane destinations, they pre-made combinations of destinations that they think would be useful:
- All 8 lanes to SATA, 4 onboard and 4 through MiniSAS
- 2 lanes to PCIe and 6 to SATA, 2 onboard and 4 through MiniSAS
- 4 lanes to PCIe and 4 to SATA, either activating the 4 onboard ports or the MiniSAS (but not both)
Commscope/Ruckus/Brocade ICX7650-48ZP will do it. Be prepared for sticker shock.
Make the 10.0.0._ addresses loopback addresses, and do point-to-point connections from each box to every other box. No idea how to do this in ESXi, but it’s straightforward in *nix/BSD.
CPU1 handles almost everything about being a normal computer: booting, chipset, most of the I/O, etc. CPU2 is along for the ride and handles its own I/O lanes (PCIe) and whatever work the kernel wants to send to it. The load is not symmetrical, so if you have turbo enabled, CPU1 will be consistently boosting more than CPU2 as it is handling all of its tasks —> warmer CPU1. This is why “tandem” dual-CPU setups have CPU1 upstream in airflow from CPU2.
2667v2 and the 2697/2696v2 are really tops for this generation.
You could desolder it and solder a new one on, or possibly even solder one on top of the existing LED. Same as replacing any other on-board component.
In that case, just copy someone else’s homework. Look up what Supermicro is using for wattage in 1U non-GPU servers, and use those numbers.
You’ll definitely need something with fast PCIe lanes for NVMe. Something with either PCIe 4.0 x4 coupled with a very fast SSD, or something with a lot of PCIe 3.0 lanes.
Is your RAM on the QVL? Ryzen’s notorious pickiness about RAM carries over to TR and EPYC, too. One of the first things before POST and BIOS splash display is memory training. If it can’t get past that, something about memory needs adjustment. Have you tried downclocking it?
How close do you want to get? Budgeting about 200W per socket for “normal”-ish CPUs and 400-450W per socket for latest EPYC should get you in the right range.
To use DOS-based flash tools, you must boot in BIOS mode with CSM (and OPROMs too, I think) enabled. If you’re booting without a CSM, use an EFI shell with the EFI flash executables.