I’ve attached part of a CUDA profiler time width plot from my application. The plot is taken after a number of iterations in the main loop, so the GPU should be “warmed up”, all kernel code downloaded etc. The kernel launches shown have all been batched together in a stream, without any interleaved H2D memcpys or synchronize() calls, so they should all have been buffered in the stream FIFO.
Trying to optimize my application, I’m worrried about all the Idle “gap” in between the kernel calls in the attached time with plot. If you measure directly in the plot image, the “gap” is about 100us, while the kernel calls are on the order of 1000us. (Assuming that GPU time is given in us)
A would imagine that NVIDIA has some very well optimized driver technology that implement these command (kernel lauch) FIFO’s very efficiently, so I’m a bit puzzled by big gap’s between my kernels.
What are your experiences? Is 100us a normal delay between kernel launches? Is the visual profiler output even accurate enough to do measurements at this time scale, or could the gaps just be a profiler phenomenon that I don’t have to worry about?
Or is my problem size (kernel runtime) just too small?