Quadro RTX 8000 Multi-GPU Performance Issue

Hi,

I have a program that runs identical CUDA code (on identical sets of data
of the same size) on 2 GPUs for multiple iterations.

The hardware is:

  • Intel® Xeon® Silver 4110 CPU @ 2.10GHz 2.10GHz (2 processors)
  • 256GB RAM
  • Windows server 2019
  • 2x Quadro RTX 8000

In each iteration:

  • Copy data from HOST pinned memory to GPU 0 and GPU 1
  • Clear previously used GPU memory through cudaMemsetAsync, execute user-write kernel (MapXYScale) to create maps for images remapping (the other kernels are calls to nppi functions, i.e. nppiWarpAffine_8u_C1R)
  • Re-execute step 2
  • Copy data from GPU to HOST (cudaEventSync)

Attached there is a screenshot of what happening from NVIDIA NSight System 2020.1.1

Problem:
When running using 2 Quadro RTX 8000, the performance is good on device 0 but in device 1 are ~5.5x worse. Anyway, in device 0, happens the same kernel (1. and 2.) is executed two times with the same data and the execution time is ~2x in second one.

The code has been tested with GPUs in WDDM and TCC mode, but without differences.

Performance problem details
Since the first iterations most of operations (device 1) are slow, including:

  • cudaMemsetAsynch
  • user-write kernel execution
  • nvidia kernels (i.e. nppiRemap_8u_C1R)

Attached a screenshot of nvidia-smi.exe during the execution, hoping it could help.

I tried running the same program on 2 Quadro P6000. The performances are
consistent across both GPUs and throughout all the iterations on this node.

The application has been compiled both with CUDA 10.2 and CUDA 11.1 without differences in performance.

Has anyone had the same problem? Any insights or suggestions on how to investigate this further would be highly appreciated.

Thank you.

Something to look into, although from the cross-check experiment with the two P6000s this does not seem like a likely underlying cause:

PCIe is a point-to-point interconnect. There is a PCIe root complex in each CPU. In multi-socket systems each PCIe slot is connected to a particular PCIe root complex. Also, each CPU has its own memory controller. You would want to make sure each CPU is communicating with the “near” GPU and the “near” system memory, as otherwise data needs to traverse the CPU-to-CPU interconnect, creating a NUMA effect.

You would therefore want to pay particular attention to processor and memory affinity of your program, and control it with a tool like numactl. Sorry, I don’t know what the equivalent tool under Windows would be, as I have never worked on a multi-socket Windows system.

1 Like