I have a program that runs identical CUDA code (on identical sets of data
of the same size) on 2 GPUs for multiple iterations.
The hardware is:
- Intel® Xeon® Silver 4110 CPU @ 2.10GHz 2.10GHz (2 processors)
- 256GB RAM
- Windows server 2019
- 2x Quadro RTX 8000
In each iteration:
- Copy data from HOST pinned memory to GPU 0 and GPU 1
- Clear previously used GPU memory through cudaMemsetAsync, execute user-write kernel (MapXYScale) to create maps for images remapping (the other kernels are calls to nppi functions, i.e. nppiWarpAffine_8u_C1R)
- Re-execute step 2
- Copy data from GPU to HOST (cudaEventSync)
Attached there is a screenshot of what happening from NVIDIA NSight System 2020.1.1
When running using 2 Quadro RTX 8000, the performance is good on device 0 but in device 1 are ~5.5x worse. Anyway, in device 0, happens the same kernel (1. and 2.) is executed two times with the same data and the execution time is ~2x in second one.
The code has been tested with GPUs in WDDM and TCC mode, but without differences.
Performance problem details
Since the first iterations most of operations (device 1) are slow, including:
- user-write kernel execution
- nvidia kernels (i.e. nppiRemap_8u_C1R)
Attached a screenshot of nvidia-smi.exe during the execution, hoping it could help.
I tried running the same program on 2 Quadro P6000. The performances are
consistent across both GPUs and throughout all the iterations on this node.
The application has been compiled both with CUDA 10.2 and CUDA 11.1 without differences in performance.
Has anyone had the same problem? Any insights or suggestions on how to investigate this further would be highly appreciated.