We recently bought 2 Titan RTX’s to put into our HP Proliant 380p server. The way we have them installed is that we have the GPU’s powered by an external 1300W Platinum PSU and the server powered by 2 1200W HP Gold PSU’s for a total of 2400W of available power in the server. Each GPU is connected to its own 16x PCIe Riser and has the latest drivers installed on it as of 02/13/2019. The OS for the system is Windows Server 2012 R2.
When the system first started to kernel power fail (with a Windows Error 41 being thrown), we ran a memory diagnostic, followed by a CPU diagnostic, followed by a GPU diagnostic. The server passed all of the tests, but with the GPU diagnostic, one GPU (GPU-A) was acting peculiar. Although GPU-A passed, it did not have the same results as its twin. It was overperforming? We have not overclocked any of the GPU’s, and all of their stats are exactly the same. Anyways, whenever our server is on, now we get kernel power fails every other hour, so we shut it off and tried them in a different system. The same events happened.
When it is just GPU-B, everything is fine, but when GPU-A comes on the scene… Armageddon.
The weird part is that there is no trend to the kernel power fails. It could just be sitting idle and fail, or when we are trying to run an intensive train session it may fail. Other times it will get through the intensive training and then kernel power fail.
Is anyone else having this issue? Please help, I am lost and have no idea of what to do.