Reducing GPU Idle Time

Hi,

I’ve attached part of a CUDA profiler time width plot from my application. The plot is taken after a number of iterations in the main loop, so the GPU should be “warmed up”, all kernel code downloaded etc. The kernel launches shown have all been batched together in a stream, without any interleaved H2D memcpys or synchronize() calls, so they should all have been buffered in the stream FIFO.

Trying to optimize my application, I’m worrried about all the Idle “gap” in between the kernel calls in the attached time with plot. If you measure directly in the plot image, the “gap” is about 100us, while the kernel calls are on the order of 1000us. (Assuming that GPU time is given in us)

A would imagine that NVIDIA has some very well optimized driver technology that implement these command (kernel lauch) FIFO’s very efficiently, so I’m a bit puzzled by big gap’s between my kernels.

What are your experiences? Is 100us a normal delay between kernel launches? Is the visual profiler output even accurate enough to do measurements at this time scale, or could the gaps just be a profiler phenomenon that I don’t have to worry about?

Or is my problem size (kernel runtime) just too small?

/Lars
profiler.png

Do you know what the process scheduling granularity is for your operating system? I wonder if another process is occasionally taking a timeslice from your program (or the driver queue).

I haven’t yet needed to mine the last millisecond out of my code, but I must say that those profiles don’t look that different to what I would expect. I guess the big issue here is that you and really relying on user space process priorities and kernel “preemptibility” (is there such a word) to get your code into the driver job queue. Seibert has a good point about scheduling granularity. If you have the ability to increase you process priority, that is what I would be looking at first.

Like the profiler itself? :playball:

I had understood that there was some sort of queuing system on the GPU side that allowed you to queue kernel launches. I don’t know if I read that somewhere, or it’s just a figment of my imanination.

Are you syncing after each kernel launch - to check for errors for example? That would break queing…

Thanks all for your comments.

I’m using CUDA 2.3 on a 295GTX in a Core i7 box, running Linux 2.6.30 (scheduling granularity 10ms). Except for my test program, the system is pretty much idle, so I doubt heavy context switching is the cause of the problem.

Also, I’m launching all (50 or so) kernel calls asynchronously in two parallel streams. The actual CPU launch time per batch is just about 0.18ms, while the GPU run time (according to the profiler) is about 60ms, so there should be enough work batched up to keep the GPU busy.

So, what kind of delays do you see in the profiler between kernels that in theory could be executed just after each other?

I never seem to get a smaller gap than about 50us between kernel calls, with average times more like 80us.

/Lars

Did you try to measure the time taken for a batch of 50 without profiling and compared that to the time taken with the profiler (gathering statistics in between kernel calls etc) ?

Not yet, but after spending a day trying to find out why some operations in my streams didn’t seem to overlap before finding out that the profiler actually serializes all streams to get “more correct” timings, I’m starting to be a bit suspicious of anything that the profiler shows. Like you say, the gaps between kernel calls may very well be a special “statistics collection” kernel that stores the statistics of the last launch into GPU memory, and may not be there at all if you don’t run under the profiler.

Has anyone done any benchmarks trying to find out the the real GPU time delay between the execution of batched up kernel calls? (as opposed to CPU time delay taken to launch (queue) up a kernel). I’ve seen some indications on these forums that this delay should be pretty much negligible, but it would be nice to know for sure, since parts of my execution time is made up of several short lived (about 1ms) kernels. Do I need to spend effort trying to merge them into one?

/L

You might save 10us by doing so.
I wouldn’t bother.