Quadro RTX 8000 Multi-GPU Performance Issue

marco.castellari · February 16, 2021, 10:47am

Hi,

I have a program that runs identical CUDA code (on identical sets of data
of the same size) on 2 GPUs for multiple iterations.

The hardware is:

Intel® Xeon® Silver 4110 CPU @ 2.10GHz 2.10GHz (2 processors)
256GB RAM
Windows server 2019
2x Quadro RTX 8000

In each iteration:

Copy data from HOST pinned memory to GPU 0 and GPU 1
Clear previously used GPU memory through cudaMemsetAsync, execute user-write kernel (MapXYScale) to create maps for images remapping (the other kernels are calls to nppi functions, i.e. nppiWarpAffine_8u_C1R)
Re-execute step 2
Copy data from GPU to HOST (cudaEventSync)

Attached there is a screenshot of what happening from NVIDIA NSight System 2020.1.1

Problem:
When running using 2 Quadro RTX 8000, the performance is good on device 0 but in device 1 are ~5.5x worse. Anyway, in device 0, happens the same kernel (1. and 2.) is executed two times with the same data and the execution time is ~2x in second one.

The code has been tested with GPUs in WDDM and TCC mode, but without differences.

Performance problem details
Since the first iterations most of operations (device 1) are slow, including:

cudaMemsetAsynch
user-write kernel execution
nvidia kernels (i.e. nppiRemap_8u_C1R)

Attached a screenshot of nvidia-smi.exe during the execution, hoping it could help.

I tried running the same program on 2 Quadro P6000. The performances are
consistent across both GPUs and throughout all the iterations on this node.

The application has been compiled both with CUDA 10.2 and CUDA 11.1 without differences in performance.

Has anyone had the same problem? Any insights or suggestions on how to investigate this further would be highly appreciated.

Thank you.

njuffa · February 16, 2021, 9:13pm

Something to look into, although from the cross-check experiment with the two P6000s this does not seem like a likely underlying cause:

PCIe is a point-to-point interconnect. There is a PCIe root complex in each CPU. In multi-socket systems each PCIe slot is connected to a particular PCIe root complex. Also, each CPU has its own memory controller. You would want to make sure each CPU is communicating with the “near” GPU and the “near” system memory, as otherwise data needs to traverse the CPU-to-CPU interconnect, creating a NUMA effect.

You would therefore want to pay particular attention to processor and memory affinity of your program, and control it with a tool like numactl. Sorry, I don’t know what the equivalent tool under Windows would be, as I have never worked on a multi-socket Windows system.

Topic		Replies	Views
Why Cudamemcpyasync has different behaviors on different CPU platforms? CUDA Programming and Performance	5	524	October 21, 2022
CUDA code runs on one card, not another CUDA Programming and Performance	10	895	July 21, 2011
CUDA+MPI = Unexplained Issues... Random Crashes, Errenous Output?!? CUDA Programming and Performance	5	3248	July 7, 2008
Peformance comparison ends in strange results CUDA Programming and Performance	3	748	August 9, 2019
Getting Different Execution Times of Running Same Kernel Twice CUDA Programming and Performance	1	19	August 13, 2024
Running multiple CUDA apps on same GPU card Serious performance drop CUDA Programming and Performance	1	1127	March 14, 2011
Strange performance regression with a single GPU context on a multi GPU host CUDA Programming and Performance	11	942	April 7, 2021
Multiple GPU very slow performance CUDA Programming and Performance	7	1231	November 10, 2022
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4323	March 7, 2011
slow kernel CUDA Programming and Performance	4	1444	June 25, 2009

Quadro RTX 8000 Multi-GPU Performance Issue

Related topics