I have written a kernel driver for a third party device to do RDMA over PCIe to my H100 GPUs. When I set it to do the RDMA through the root complex it can do that successfully, but when I try to have it go through a PCIe switch to the nearest GPU I get hardware crashes with no crash logs. The switch is a pretty standard gen4 PEX of sorts. Everything shows up in lspci as it should, and I don’t suspect any errors on the third party device.
I would appreciate any tips, advice or technical support.