I would like to ask you for your help on an issue handling PCIe traffic between NVIDIA GPUs and an AMD EPYC based Server Platform.
My company is providing HW and SW solutions for the development of driver assistance systems and control units. One important project is the developing of a high performance computing platform for the development of AI- Algorithms for autonomous driving.
One major Tier1 for ADAS- applications is using our high performace computing – AMD EPYC Platform with in total 5 Tesla GPUs.
Now, he is experiencing a similar behavior as you are describing in your technical walkthrough „Benchmarking GPUDirect RDMA on Modern Server Platforms“
Two Tesla GPUs are connected to the Server Mainboard via a PEX8764 PCIe Gen3 Switch (upstream: PCIe Gen3 x16; downstream PCIe Gen3 x8).
When measuring the PCIe GPU-to-Host bandwidth we have the following phenomenon:
One single GPU-to Host bandwidth = stable 6,5GByte/s – which is within the exptected range (PCIe Gen3 x8 max. 7,8GByte/s)
Two GPUs-to-Host bandwidth = instable max. 3,4GByte/s – which is too less.
When we exchange the Server Platform to an Intel XEON based one, the Two GPUs-to-Host bandwidth is stable 6,5GByte/s.
I guess, now you can imagine why I contact you. Because it looks like, that the PCIe bandwidth is depending on the used Server Platform Architecture.
It looks like that the PCIe traffic on the EPYC Platform has a problem with the „multiplexed“ architecture over the PCIe switch.
Do you have any advice or hint, how we can open this bottleneck? Perhaps, you have already experience with the AMD EPYC Platform combined with NVIDIA GPUs?