I have an app that transfers (HtoD) data to the GPU for processing by a few kernels. There is a very small DtoH transfer of results back to the host, then the process repeats. Here’s a screenshot of one such “processing cycle”:
The green “Memcpy HtoD” takes ~300us (3.7Mb), followed by a delay of ~64us before the first kernel starts (highlighted). What will be causing this delay, and can anything be done about it? I’m still getting to grips with NSight but the “CUDA API” row seems to show the host issuing the cuda commands “back to back” (more or less), suggesting the delay isn’t originating from here.
(Probably not relevant but just in case you are curious: there is a very small H2D memcpy (tens of bytes) immediately before the first kernel, plus a cudaMemset before the second kernel).
It’s difficult to work with the tiniest bit of profiler output. However if there is an additional op that you haven’t shown, that is another cudaMemcpy H2D right before the kernel, then it seems that could be an explanation for the gap.
Every CUDA operation that you launch on a GPU (kernel calls, cudaMemcpy operations, etc.) has latencies and overheads. Just because your cudaMemcpy H2D of 3.7Mb takes 300us does not mean that a cudaMemcpy operation of 1 byte will take approximately zero microseconds. It will take approximately 10 or more microseconds. Since you’re asking about where did a few tens of microseconds go, this stuff matters, and should not be omitted from your analysis or your data provided to others, when asking for help.
Apologies, I didn’t think the small H2D copy would be relevant as it takes ~350ns to execute. It’s barely visible in the above screenshot, at the end of the 64us “idle” region that is the focus of the question.
I’ve just been looking more closely at the tooltip of that small copy and I can now see that it shows a latency of 46us, so I guess that explains the majority of that larger 64us delay.
I’ve read that calling cudaStreamQuery(0) can help here, but are there any adverse effects in doing this?
You can also consider graphs to speed up launch latencies, as it perhaps takes away some microseconds.
Are you on windows? Information like this matters.
Yes it’s Windows. GPU is a Quadro RTX4000 although we’re looking to upgrade to an RTX4070.
on windows, if I were worried about scheduling delays, rather than using the previous recommendation you suggested, I would first try both settings of Windows Hardware Accelerated GPU Scheduling, to see if one or the other setting produced a better launch sequence for my test case. I would do that first, before other approaches.
I’m not suggesting this will make a difference in your test case. I don’t know exactly what the source of 64us or 46us or 18us gap is. I’m simply suggesting that if I were thinking about going down the road you seem to be thinking about, I would do it that way.