Delay before kernel runs?

andrew.stephens · August 22, 2024, 2:04pm

I have an app that transfers (HtoD) data to the GPU for processing by a few kernels. There is a very small DtoH transfer of results back to the host, then the process repeats. Here’s a screenshot of one such “processing cycle”:

The green “Memcpy HtoD” takes ~300us (3.7Mb), followed by a delay of ~64us before the first kernel starts (highlighted). What will be causing this delay, and can anything be done about it? I’m still getting to grips with NSight but the “CUDA API” row seems to show the host issuing the cuda commands “back to back” (more or less), suggesting the delay isn’t originating from here.

(Probably not relevant but just in case you are curious: there is a very small H2D memcpy (tens of bytes) immediately before the first kernel, plus a cudaMemset before the second kernel).

Robert_Crovella · August 22, 2024, 3:10pm

It’s difficult to work with the tiniest bit of profiler output. However if there is an additional op that you haven’t shown, that is another cudaMemcpy H2D right before the kernel, then it seems that could be an explanation for the gap.

Every CUDA operation that you launch on a GPU (kernel calls, cudaMemcpy operations, etc.) has latencies and overheads. Just because your cudaMemcpy H2D of 3.7Mb takes 300us does not mean that a cudaMemcpy operation of 1 byte will take approximately zero microseconds. It will take approximately 10 or more microseconds. Since you’re asking about where did a few tens of microseconds go, this stuff matters, and should not be omitted from your analysis or your data provided to others, when asking for help.

andrew.stephens · August 23, 2024, 7:22am

Apologies, I didn’t think the small H2D copy would be relevant as it takes ~350ns to execute. It’s barely visible in the above screenshot, at the end of the 64us “idle” region that is the focus of the question.

I’ve just been looking more closely at the tooltip of that small copy and I can now see that it shows a latency of 46us, so I guess that explains the majority of that larger 64us delay.

I’ve read that calling cudaStreamQuery(0) can help here, but are there any adverse effects in doing this?

Curefab · August 23, 2024, 9:04am

You can also consider graphs to speed up launch latencies, as it perhaps takes away some microseconds.

Robert_Crovella · August 23, 2024, 1:23pm

Are you on windows? Information like this matters.

andrew.stephens · August 23, 2024, 1:28pm

Yes it’s Windows. GPU is a Quadro RTX4000 although we’re looking to upgrade to an RTX4070.

Robert_Crovella · August 23, 2024, 1:35pm

on windows, if I were worried about scheduling delays, rather than using the previous recommendation you suggested, I would first try both settings of Windows Hardware Accelerated GPU Scheduling, to see if one or the other setting produced a better launch sequence for my test case. I would do that first, before other approaches.

I’m not suggesting this will make a difference in your test case. I don’t know exactly what the source of 64us or 46us or 18us gap is. I’m simply suggesting that if I were thinking about going down the road you seem to be thinking about, I would do it that way.

Topic		Replies	Views
CudaMemcpyAsync wait long time to launch CUDA Programming and Performance cuda , kernel	8	1934	April 11, 2022
Diff. between CPU / GPU kernel execution time CUDA Programming and Performance	4	1647	May 18, 2010
Slow memory transfers CUDA Programming and Performance	7	1986	May 23, 2011
Memory Latencies for Small Data Transfers CUDA Programming and Performance	7	3106	July 15, 2014
cudaMemcpyAsync H2D launch takes more time with <= 24576 bytes CUDA Programming and Performance	7	35	September 11, 2024
Latency when running a cuda code CUDA Programming and Performance	10	3357	December 30, 2020
Too much time for kernel launch latency CUDA Programming and Performance	9	2298	November 28, 2022
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1428	July 17, 2017
What the gaps on the nvvp pipeline mean? And how to shrink the gap size? CUDA Programming and Performance	6	747	September 15, 2019
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25029	March 8, 2010

Delay before kernel runs?

Related topics