PCIe traffic between NVIDIA GPUs and AMD EPYC Server Platform

I would like to ask you for your help on an issue handling PCIe traffic between NVIDIA GPUs and an AMD EPYC based Server Platform.

My company is providing HW and SW solutions for the development of driver assistance systems and control units. One important project is the developing of a high performance computing platform for the development of AI- Algorithms for autonomous driving.

One major Tier1 for ADAS- applications is using our high performace computing – AMD EPYC Platform with in total 5 Tesla GPUs.

Now, he is experiencing a similar behavior as you are describing in your technical walkthrough „Benchmarking GPUDirect RDMA on Modern Server Platforms

Two Tesla GPUs are connected to the Server Mainboard via a PEX8764 PCIe Gen3 Switch (upstream: PCIe Gen3 x16; downstream PCIe Gen3 x8).

When measuring the PCIe GPU-to-Host bandwidth we have the following phenomenon:

One single GPU-to Host bandwidth = stable 6,5GByte/s – which is within the exptected range (PCIe Gen3 x8 max. 7,8GByte/s)

Two GPUs-to-Host bandwidth = instable max. 3,4GByte/s – which is too less.

But:

When we exchange the Server Platform to an Intel XEON based one, the Two GPUs-to-Host bandwidth is stable 6,5GByte/s.

I guess, now you can imagine why I contact you. Because it looks like, that the PCIe bandwidth is depending on the used Server Platform Architecture.

It looks like that the PCIe traffic on the EPYC Platform has a problem with the „multiplexed“ architecture over the PCIe switch.

Do you have any advice or hint, how we can open this bottleneck? Perhaps, you have already experience with the AMD EPYC Platform combined with NVIDIA GPUs?

[Note: The CUDA forums are primarily a venue for users assisting other users. NVIDIA personnel monitors them fairly regularly, but you may need to wait a couple days before someone from NVIDIA responds.]

Can you be more specific about the Intel and AMD platforms as well the Tesla GPU model being used here? It would be best to give their exact designations. I assume the EPYC system is a single socket system, but the EPYC chip may be constructed of two CPU dies internally? Is the Xeon-based system a single-socket or dual-socket system?

NVIDIA definitely has practical experience with EPYC-based systems, as some of NVIDIA’s DGX products use EPYC CPUs:

https://www.crn.com/slide-shows/components-peripherals/nvidia-s-8-biggest-gtc-2020-product-announcements-you-might-have-missed/3

Each DGX A100 system contains eight A100 Tensor Core GPUs, two 64-core AMD EPYC processors, nine Mellanox ConnectX-6 SmartNICs and a Mellanox HDR InfiniBand 200 interconnect.

Here are more specific informations about the systems:

AMD platform:
CPU: AMD Epyc 7H12 64x2.6/3.30GHz - Rome
Mainboard: Supermicro H11SSL-i
GPU: 2xTesla T4
doesn’t work!

Intel platform:
CPU: Intel Xeon Silver 4209T (8x 2.20GHz), Cascade Lake
Mainboard: Supermicro X11DPH-T
GPU: 2xGTX 1650
works!

The above mentioned AMD platform is the platform of our customer. In our office we have nearly the same setup for testing except the Tesla graphic cards. Here we use 2xGTX 1650.

The EPYC System is a single-socket system, whereas the Xeon-based system is dual-socket.
I don’t know if the epyc chip is constructed of two CPU dies internally.

So we found out that the model of the graphics card doesn’t matter. The problem is whether it is a AMD or Intel system.

Thanks in advance for your help!

Looking at the H11SSL-i motherboard manual (Figure 1-3, page 17) I don’t see that toplogy. There is no PEX switch on that motherboard.

My suggestion would be to take up your concerns with Supermicro.

You’re right. On the mainboard there isn’t such a switch. The switch is separate hardware and is connected to one of the mainboards PCIe x16 slots. The two graphic cards are then connected to the switch.

Then you may wish to take up your concerns with whoever is the vendor of that extra piece of hardware. I don’t think the system is qualified by SMC that way, and NVIDIA doesn’t support that configuration. We provide support to Supermicro for the platforms that Supermicro has tested according to our qualification process.

Your report seems consistent with the idea that the upstream link is x8 rather than x16. This might be the case if the external hardware is designed that way. It might also be the case if it happens to be plugged into a x8 slot on the motherboard rather than a x16 slot. But even if none of that is true, I wouldn’t be able to tell you what is wrong.

As an alternative, plug your GPUs directly into the motherboard. There appear to be 3 dedicated x16 slots and 3 dedicated x8 slots.