Gap between some thread calls


I am using a Geforce770GTX and Cuda 6.0. I have been experiencing some performance issues and I would like to know if anyone knows what might be happening. When analyzing the performance (Visual Profiler) of some kernels calls I find that there is a gap repeating over time between them.

A couple of kernels are called one after the other in a for loop and a “big gap” (about 0.25ms) is happening time to time, for instance after 20 kernel executions. However, this issue does not happen with other kernels. I have been experiencing this issue with kernels that are quite fast (about 20us). Any guess?

Here is a screenshot that shows the issue

Thank you!

To my untrained eye it would seem that the kernel launches are getting queued by the driver, and then sent off all at once. there is one cudaLaunch API call inbetween that seems to take a long time.

Are your kernels taking large amounts of texture references, or big kernel call parameter sets? These take some time to initialize and transfer.

Are you on Linux or Windows? can’t quite tell that from the screenshot you posted.

Hi cbuchner1, thanks for your reply

I was guessing something similar, it looks like the cudaLaunch API is sent in batches or it is too slow for calling the kernels. Each cudaLaunch block appears to have a duration of 25us, which is actually larger than the kernel itself, 20us. I dont know if it is a Visual Profiler issue or it makes sense.

Actually, if the kernel is faster, the gaps are also wider:

The kernels are quite simple, no textures are used, and the parameters are just pointers and few integers. Here is the header:

__global__ void calculateRightOcclusionKernel(const unsigned char* disp, unsigned char* occ_map, int ncols, int nrows, int num_disp, int colStepOcc)

I am working on Windows 8.1 with the GeForce 320.52 driver

Thank you very much!

OK, just tried with an empty kernel and got the same behaviour.

Maybe I just have to design the kernels increasing their work to reduce the kernel launch overhead. For the experiments I have done, it seems like it worth launching a kernel if it is at least 25us.

Does anyone support this with a theoretical explanation?


It’s very likely a WDDM batching effect that you are observing.

In a WDDM system, the GPU is managed by the OS and the graphics driver must interact with the OS to schedule GPU activity. This presents considerable overhead, at least when compared to the behavior e.g. under linux or under a GPU managed as a TCC device under windows.

To mitigate the overhead, the NVIDIA GPU driver (in WDDM mode) will collect together sequences of activity intended for the GPU, and deliver them in batches, so that the cost of the communication overhead can be amortized over the batch.

I can’t give you detailed heuristics of the mechanism, but if you search around these forums you’ll find various discussions of WDDM batching including comments from Greg@NV who has given some descriptions, and methods to manipulate the behavior.

You may be able to tweak a specific interaction, but I’m doubtful that you can yield a net benefit over a variety of interactions from a busy app.

Thank you txbob, great topic to consider. I didn’t know there were such a difference for CUDA between platforms. I am strongly considering to port the application to Linux… If I find something interesting on this topic I will write you back with the results.

Thank you guys for your help!



Just in case anyone is interested, I have just ported the application to linux Ubuntu 14.04 and this “gap” effect has totally dissapeared. Now kernels are lunched one after the other as they should. In windows my application performed one iteration in 0.48ms and in the same computer, same implementation, in linux obtained 0.35ms. That is quite an important performance boost in my opinion. As txbob noted, it was a WDDM issue.

Just in case anyone is interested in how Visual Profiler looks like now: