Encoding performance degrades when the GPU used is not connected to the first PCIe slot

oamoros0ealf · January 21, 2021, 4:19pm

We sell a product that runs our software in a system based on Supermicro + AMD EPYC boards, with 3 Quadro RTX4000.

We decode video, do a lot of things with the individual frames, and then encode some video outputs.

After some extensive performance testing, we found out, that if we encode in the GPU that is connected in the first slot of the motherboard, we can encode more feeds than if we do it in another GPU.

The workload for the encoding GPU is always the same, in terms of CUDA, OpenGL, decoding and encoding.

We use WDDM on Windows 2016 Enterprise LTS

The Supermicro motherboard is H11SSL-i, the CPU is AMD EPYC 7401P, 64GB of ram (8 channels used), configured as per Die memory interleaving, so that the entire CPU is seen as a single NUMA node.

Any idea if this is normal behavior?

Thanks!

mandar_godse · March 12, 2021, 12:06pm

Hi,
I may not be able to root cause this issue, but here are a few pointers that may help.

Please check the CPU affinity of GPU.
Please check PCIe link generation and supported bus width.
You may be able to use ‘nvidia-smi’ tool to get this information. You can also try ‘bandwithTest’ from CUDA Toolkit to test the memory copy bandwidth for given device across PCIe.

Above may help you isolate the bottleneck.

Thanks.

oamoros0ealf · July 1, 2021, 3:15pm

Hi!

1 There is no affinity visible by the OS, since the BIOS is treating the EPIC 7401P CPU as a single CPU with 24 cores, and this is how Windows see’s it. Internally, the CPU has 4 dies, and each has 32 PCIe lanes and 2 RAM memory channels. The thing is that with the BIOS setting, we are interleaving the memory accesses as if it whas an 8 channel CPU… so it’s quit hard to investigate this…
2 PCIe 3.0

Oscar

Topic		Replies	Views
Slow parallel performance when using three (3) Nvidia p100 for encoding/decoding on the same server. CUDA Programming and Performance	1	800	December 18, 2018
low transfer bandwidth between CPU and GPU my GTX 580 has a slow transfer speed CUDA Programming and Performance	9	3655	August 10, 2011
[nvbandwidth] Debug an Anomalous Host to Device Memory Bandwidth CUDA Programming and Performance	7	1075	November 30, 2023
Memory bandwidth CUDA Programming and Performance	31	38493	October 5, 2007
The fastest platform of GPU computing CUDA Programming and Performance	38	40321	January 19, 2010
PCIe A100s - Slow PCIe speed tests? CUDA Programming and Performance	2	1312	April 12, 2021
Terrible host<->device bandwidth seen with bandwidthtest CUDA Programming and Performance	5	827	October 12, 2021
Large CUDA Bandwidth Discrepancy on Identical RTX A4000 GPUs (EPYC 9124 vs. 7343, Supermicro H13SSL-N vs. H12SSL-CT) Compute Sanitizer pcie , cuda , linux	1	61	July 17, 2025
Low P2P GPU bandwidth performance between GeForce GPUs CUDA Programming and Performance	20	1095	October 9, 2024
Two GPUs with different PCIe generations? will performance suffer? CUDA Programming and Performance	4	2732	August 25, 2008

Encoding performance degrades when the GPU used is not connected to the first PCIe slot

Related topics