I’ve attached part of a CUDA profiler time width plot from my application. The plot is taken after a number of iterations in the main loop, so the GPU should be “warmed up”, all kernel code downloaded etc. The kernel launches shown have all been batched together in a stream, without any interleaved H2D memcpys or synchronize() calls, so they should all have been buffered in the stream FIFO.
Trying to optimize my application, I’m worrried about all the Idle “gap” in between the kernel calls in the attached time with plot. If you measure directly in the plot image, the “gap” is about 100us, while the kernel calls are on the order of 1000us. (Assuming that GPU time is given in us)
A would imagine that NVIDIA has some very well optimized driver technology that implement these command (kernel lauch) FIFO’s very efficiently, so I’m a bit puzzled by big gap’s between my kernels.
What are your experiences? Is 100us a normal delay between kernel launches? Is the visual profiler output even accurate enough to do measurements at this time scale, or could the gaps just be a profiler phenomenon that I don’t have to worry about?
Or is my problem size (kernel runtime) just too small?
Do you know what the process scheduling granularity is for your operating system? I wonder if another process is occasionally taking a timeslice from your program (or the driver queue).
I haven’t yet needed to mine the last millisecond out of my code, but I must say that those profiles don’t look that different to what I would expect. I guess the big issue here is that you and really relying on user space process priorities and kernel “preemptibility” (is there such a word) to get your code into the driver job queue. Seibert has a good point about scheduling granularity. If you have the ability to increase you process priority, that is what I would be looking at first.
I had understood that there was some sort of queuing system on the GPU side that allowed you to queue kernel launches. I don’t know if I read that somewhere, or it’s just a figment of my imanination.
Are you syncing after each kernel launch - to check for errors for example? That would break queing…
I’m using CUDA 2.3 on a 295GTX in a Core i7 box, running Linux 2.6.30 (scheduling granularity 10ms). Except for my test program, the system is pretty much idle, so I doubt heavy context switching is the cause of the problem.
Also, I’m launching all (50 or so) kernel calls asynchronously in two parallel streams. The actual CPU launch time per batch is just about 0.18ms, while the GPU run time (according to the profiler) is about 60ms, so there should be enough work batched up to keep the GPU busy.
So, what kind of delays do you see in the profiler between kernels that in theory could be executed just after each other?
I never seem to get a smaller gap than about 50us between kernel calls, with average times more like 80us.
Did you try to measure the time taken for a batch of 50 without profiling and compared that to the time taken with the profiler (gathering statistics in between kernel calls etc) ?
Not yet, but after spending a day trying to find out why some operations in my streams didn’t seem to overlap before finding out that the profiler actually serializes all streams to get “more correct” timings, I’m starting to be a bit suspicious of anything that the profiler shows. Like you say, the gaps between kernel calls may very well be a special “statistics collection” kernel that stores the statistics of the last launch into GPU memory, and may not be there at all if you don’t run under the profiler.
Has anyone done any benchmarks trying to find out the the real GPU time delay between the execution of batched up kernel calls? (as opposed to CPU time delay taken to launch (queue) up a kernel). I’ve seen some indications on these forums that this delay should be pretty much negligible, but it would be nice to know for sure, since parts of my execution time is made up of several short lived (about 1ms) kernels. Do I need to spend effort trying to merge them into one?
Hello, I am seeing the same thing. The kernel exec times are in 50us, but the GPU is idle between 2 kernel calls for 120 ms, not sure how to get over this . Is it the profiling overhead or is it the OS not prioritizing execution my application over its own tasks? Can someone give inputs on this…?
one possible solution is to make your kernels run for much longer than 50us. Then the intervening gap will be less significant, in terms of performance.
And you would want to inspect any activity that is occurring between kernel launches, such as other CUDA library calls. The profiler should give an indication of this.
in my profiling i am including the cudamemcpy etc. Even between the last cudamemcpy to host , and next kernel call ( i am looking at GPU utilization ) there is literal idel time of gpu for 100ms, while the cpu swtiches the thread from cpu 8 to cpu 10.
Can this cause be causing the delay? Will it be any different if i run this on dedicated target such as Xavier instead of running it on a GPU laptop?
I’m not sure what would be different on a Jetson device. Your code might not be much, or any different. I don’t know that it is written anywhere that “if you use a Jetson device it’s guaranteed that there won’t ever be any gaps in your timelines”. I’m not sure what it is about a jetson device that makes this any different.
If your GPU on Ubuntu 20.04 is not configured as a X display, then it is not being shared in any way. It is dedicated to your program. (I’m assuming here you’re not running on a server with many other users also using the GPU in a free-for-all.)
Maybe your cudaMemcpy is taking 100ms. A profiler can answer many of these questions.
Well, I am profiling it via nsight-sys . cudamemcpy finished in 1ms. And i see the gpu being idle. So let me see if i can suspend the display and try to run the application and replay the profiler. Thanks.
100ms is half an eternity on modern hardware. I have a hard time coming up with a plausible hypothesis what might trigger such delays (assuming they have been measured correctly and it is not 100 microseconds instead of 100 milliseconds).
From reading along it seems that explicit data copies have been eliminated as the potential reason for the 100ms delay. Are there possibly additional implicit data copies due to the use of managed memory?
Gaps in GPU activity could be caused by the CPU not sending work in a timely fashion. Two things you could try as an experiment is raising the CPU frequency (not sure whether there are any mechanisms for that on Xavier) or raising the process priority of your application.
Does the host side code for this application included any locking / synchronization? I have a hard time imagining what might cause host code to spin on a lock for 100ms, though. Maybe something I/O related. Is the app using so much system memory that it is swapping to mass storage?
Generally speaking, when dealing with NVIDIA’s embedded platforms, it is best to ask about issues in the sub-forum dedicated to each embedded platforms, as you are likely to receive faster and more plentiful answers there.