Encoding performance degrades when the GPU used is not connected to the first PCIe slot

We sell a product that runs our software in a system based on Supermicro + AMD EPYC boards, with 3 Quadro RTX4000.

We decode video, do a lot of things with the individual frames, and then encode some video outputs.

After some extensive performance testing, we found out, that if we encode in the GPU that is connected in the first slot of the motherboard, we can encode more feeds than if we do it in another GPU.

The workload for the encoding GPU is always the same, in terms of CUDA, OpenGL, decoding and encoding.

We use WDDM on Windows 2016 Enterprise LTS

The Supermicro motherboard is H11SSL-i, the CPU is AMD EPYC 7401P, 64GB of ram (8 channels used), configured as per Die memory interleaving, so that the entire CPU is seen as a single NUMA node.

Any idea if this is normal behavior?


I may not be able to root cause this issue, but here are a few pointers that may help.

  1. Please check the CPU affinity of GPU.
  2. Please check PCIe link generation and supported bus width.
    You may be able to use ‘nvidia-smi’ tool to get this information. You can also try ‘bandwithTest’ from CUDA Toolkit to test the memory copy bandwidth for given device across PCIe.

Above may help you isolate the bottleneck.



1 There is no affinity visible by the OS, since the BIOS is treating the EPIC 7401P CPU as a single CPU with 24 cores, and this is how Windows see’s it. Internally, the CPU has 4 dies, and each has 32 PCIe lanes and 2 RAM memory channels. The thing is that with the BIOS setting, we are interleaving the memory accesses as if it whas an 8 channel CPU… so it’s quit hard to investigate this…
2 PCIe 3.0