Hi,
I am using H100 PCIe on a supermicro 4U server. I noticed that nudging of the 12vhpwr cables causes the GPUs to fall off the bus, reported by dmesg. After reboot, it is fine.
The cables seem very fragile, and heavy load seems to cause GPUs to fall off the bus as well, either due to high fan speed or some other factor. All situations are temporarily fixed after reboot.
I would like to ask, is this a cable issue? or is H100 PCIe very fragile to power cable disconnects?