I have the same software and workflow run on multiple workstations and laptops. I am experiencing large GPU idle time (around 100ms) when running on Quadro RTX 4000. But I haven’t seen this problem on a workstation with 4 RTX 3070 GPUs or a laptop with a single RTX 2060 GPU. Attached are the profiling reports from Nsight Systems. I am running on Cuda toolkit v10.2.
Below are my questions:
is this issue caused by cufft memory allocation?
if so, why memory allocation is so much slower on Quadro RTX 4000?
how to get around this problem?
For Quadro RTX 4000, the performance is much worse when I switch to cuda toolkit 11.5 while other GPUs are working properly. Any reason behind this?
I don’t know how to check the cuda driver version. But I install the cuda driver that comes with the toolkit 10.2 and I assume that shouldn’t be an issue.
I believe that the malloc is from cufftPlanMany. Is it possible I create this plan outside this loop and reuse the plans just like other memory buffers?