Reducing GPU Idle Time

lars · November 26, 2009, 12:17am

Hi,

I’ve attached part of a CUDA profiler time width plot from my application. The plot is taken after a number of iterations in the main loop, so the GPU should be “warmed up”, all kernel code downloaded etc. The kernel launches shown have all been batched together in a stream, without any interleaved H2D memcpys or synchronize() calls, so they should all have been buffered in the stream FIFO.

Trying to optimize my application, I’m worrried about all the Idle “gap” in between the kernel calls in the attached time with plot. If you measure directly in the plot image, the “gap” is about 100us, while the kernel calls are on the order of 1000us. (Assuming that GPU time is given in us)

A would imagine that NVIDIA has some very well optimized driver technology that implement these command (kernel lauch) FIFO’s very efficiently, so I’m a bit puzzled by big gap’s between my kernels.

What are your experiences? Is 100us a normal delay between kernel launches? Is the visual profiler output even accurate enough to do measurements at this time scale, or could the gaps just be a profiler phenomenon that I don’t have to worry about?

Or is my problem size (kernel runtime) just too small?

/Lars

seibert · November 26, 2009, 5:58pm

Do you know what the process scheduling granularity is for your operating system? I wonder if another process is occasionally taking a timeslice from your program (or the driver queue).

avidday · November 26, 2009, 6:19pm

I haven’t yet needed to mine the last millisecond out of my code, but I must say that those profiles don’t look that different to what I would expect. I guess the big issue here is that you and really relying on user space process priorities and kernel “preemptibility” (is there such a word) to get your code into the driver job queue. Seibert has a good point about scheduling granularity. If you have the ability to increase you process priority, that is what I would be looking at first.

jma · November 26, 2009, 8:41pm

Like the profiler itself? External Media

Tigga · November 26, 2009, 9:29pm

I had understood that there was some sort of queuing system on the GPU side that allowed you to queue kernel launches. I don’t know if I read that somewhere, or it’s just a figment of my imanination.

Are you syncing after each kernel launch - to check for errors for example? That would break queing…

lars · November 26, 2009, 10:27pm

Thanks all for your comments.

I’m using CUDA 2.3 on a 295GTX in a Core i7 box, running Linux 2.6.30 (scheduling granularity 10ms). Except for my test program, the system is pretty much idle, so I doubt heavy context switching is the cause of the problem.

Also, I’m launching all (50 or so) kernel calls asynchronously in two parallel streams. The actual CPU launch time per batch is just about 0.18ms, while the GPU run time (according to the profiler) is about 60ms, so there should be enough work batched up to keep the GPU busy.

So, what kind of delays do you see in the profiler between kernels that in theory could be executed just after each other?

I never seem to get a smaller gap than about 50us between kernel calls, with average times more like 80us.

/Lars

jma · November 26, 2009, 11:28pm

Did you try to measure the time taken for a batch of 50 without profiling and compared that to the time taken with the profiler (gathering statistics in between kernel calls etc) ?

lars · November 27, 2009, 11:46am

Not yet, but after spending a day trying to find out why some operations in my streams didn’t seem to overlap before finding out that the profiler actually serializes all streams to get “more correct” timings, I’m starting to be a bit suspicious of anything that the profiler shows. Like you say, the gaps between kernel calls may very well be a special “statistics collection” kernel that stores the statistics of the last launch into GPU memory, and may not be there at all if you don’t run under the profiler.

Has anyone done any benchmarks trying to find out the the real GPU time delay between the execution of batched up kernel calls? (as opposed to CPU time delay taken to launch (queue) up a kernel). I’ve seen some indications on these forums that this delay should be pretty much negligible, but it would be nice to know for sure, since parts of my execution time is made up of several short lived (about 1ms) kernels. Do I need to spend effort trying to merge them into one?

/L

jma · November 27, 2009, 2:21pm

You might save 10us by doing so.
I wouldn’t bother.

kunal.vora · June 14, 2022, 7:52pm

Hello, I am seeing the same thing. The kernel exec times are in 50us, but the GPU is idle between 2 kernel calls for 120 ms, not sure how to get over this . Is it the profiling overhead or is it the OS not prioritizing execution my application over its own tasks? Can someone give inputs on this…?

Robert_Crovella · June 14, 2022, 7:56pm

which OS?

one possible solution is to make your kernels run for much longer than 50us. Then the intervening gap will be less significant, in terms of performance.

And you would want to inspect any activity that is occurring between kernel launches, such as other CUDA library calls. The profiler should give an indication of this.

kunal.vora · June 14, 2022, 7:57pm

ubuntu 20.04

Robert_Crovella · June 14, 2022, 7:59pm

make sure you don’t have any X display configured for the GPU.

kunal.vora · June 14, 2022, 7:59pm

in my profiling i am including the cudamemcpy etc. Even between the last cudamemcpy to host , and next kernel call ( i am looking at GPU utilization ) there is literal idel time of gpu for 100ms, while the cpu swtiches the thread from cpu 8 to cpu 10.
Can this cause be causing the delay? Will it be any different if i run this on dedicated target such as Xavier instead of running it on a GPU laptop?

Robert_Crovella · June 14, 2022, 8:01pm

100ms. Oh. You’ll need to use a profiler to figure out what is happening in the gap. It could be anything.

kunal.vora · June 14, 2022, 8:04pm

Even on a dedicated target such as xavier …?

Robert_Crovella · June 14, 2022, 8:06pm

I’m not sure what would be different on a Jetson device. Your code might not be much, or any different. I don’t know that it is written anywhere that “if you use a Jetson device it’s guaranteed that there won’t ever be any gaps in your timelines”. I’m not sure what it is about a jetson device that makes this any different.

If your GPU on Ubuntu 20.04 is not configured as a X display, then it is not being shared in any way. It is dedicated to your program. (I’m assuming here you’re not running on a server with many other users also using the GPU in a free-for-all.)

Maybe your cudaMemcpy is taking 100ms. A profiler can answer many of these questions.

Robert_Crovella · June 14, 2022, 8:08pm

I just noticed you said a laptop. So maybe your GPU is being shared for graphics work. I don’t know.

kunal.vora · June 14, 2022, 8:15pm

Well, I am profiling it via nsight-sys . cudamemcpy finished in 1ms. And i see the gpu being idle. So let me see if i can suspend the display and try to run the application and replay the profiler. Thanks.

njuffa · June 14, 2022, 8:42pm

100ms is half an eternity on modern hardware. I have a hard time coming up with a plausible hypothesis what might trigger such delays (assuming they have been measured correctly and it is not 100 microseconds instead of 100 milliseconds).

From reading along it seems that explicit data copies have been eliminated as the potential reason for the 100ms delay. Are there possibly additional implicit data copies due to the use of managed memory?

Gaps in GPU activity could be caused by the CPU not sending work in a timely fashion. Two things you could try as an experiment is raising the CPU frequency (not sure whether there are any mechanisms for that on Xavier) or raising the process priority of your application.

Does the host side code for this application included any locking / synchronization? I have a hard time imagining what might cause host code to spin on a lock for 100ms, though. Maybe something I/O related. Is the app using so much system memory that it is swapping to mass storage?

Generally speaking, when dealing with NVIDIA’s embedded platforms, it is best to ask about issues in the sub-forum dedicated to each embedded platforms, as you are likely to receive faster and more plentiful answers there.

Topic		Replies	Views
reasons why splitting large kernel to smaller one lower perfromance CUDA Programming and Performance	4	3762	February 15, 2016
What are possible reasons of heavy kernel launch latency? CUDA Programming and Performance cuda , kernel , python	12	975	April 15, 2025
How to explain the performance difference? CUDA Programming and Performance	7	3516	March 26, 2008
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38129	February 1, 2011
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2807	November 17, 2021
Gap between some thread calls CUDA Programming and Performance	6	1277	October 30, 2014
CUDA Profiler CUDA Programming and Performance	7	12754	October 18, 2010
EVSL Lib is 190 ms on Quadro P520, but 82 ms on Titan RTX2080 CUDA Programming and Performance performance	12	654	December 27, 2020
Profiler timings vs. real world timings. VERY different... CUDA Programming and Performance	8	2424	May 15, 2009
Large gaps between API calls and execution CUDA Programming and Performance	4	961	December 13, 2019

Reducing GPU Idle Time

Related topics