The big issue is that you're going to have to disable 16bit floats for doing all the work and do it all in 32bit floats (not storing weights, but the calculations themselves) once you try to combine with a P40, you can still get alright performance on them (I'm using 4 of them) but you'll cripple the performance of the 4090 doing that. I don't know if any of the libraries for running things will handle conversion and different kernels on different cards to avoid that since it's a completely different set of code for that.
You'd do much much better with adding a used 3090 from ebay (assuming it works) really.
Not sure which models you'd want specificially, but take a look at Asrock Rack, they've got a lot of ATX compatible server motherboards. https://www.asrockrack.com/minisite/EPYC9004/