Many kernels executed in streams cause driver timeout

Hi All

I’m trying to execute several cuda kernels after each other using streams. When my number becomes too high, I get a driver time-out even though an individual kernel has no problems executing.

My code is very similar to this:

for( n = 0; n < nMax; n++ ){

  cudaStreamCreate( &pStream[n]);


for( n = 0; n < nMax; n++ ){

  cudaKernel<<<dimGrid, dimBlock, nSharedMemorySize, pStream[n]>>>( pParam );


for( n = 0; n < nMax; n++ ){

  cudaStreamDestroy( pStream[n] )


When I set nMax to <10, everything works fine. Setting it to 20 and I get a timeout.

My understanding was that if only the individual kernels finish regularly all is fine. Is there some default way I should handle this, or should I simply make sure to make “breaks” in my execution line to allow for display updates?

Best regards


You are actually launching each kernel in a separate stream, not after each other. So you have no control on the order in which they run.

I have no clue, but maybe there is a maximum number of streams?

shouldn’t you wait for the streams to finish before destroying them ?

I was under the assumption that the destroyStream call would wait for the full stream to finish processing before returning to the CPU.

Also, I don’t really care in which order the process, as long as they all finish eventually.


well, never assume anything without verification ;)

I think I read it someplace though ;)

Anyway, I inserted a cudaStreamSynchronize call before the destroy command, and I get the same result.


Are you using a Fermi-class GPU? Is a display connected?
You might run into an ordinary timeout because a kernel starts with fewer SMs available than it would if it ran on its own.

Hi Tera

I’m using a 480GTX, and the display is connected yes.

My kernel is processing for less than a second. Closer to 0.1s to be exact. So each kernel should not time out, unless it is somehow hindered because of the multitude of streams?

My understandings of how streams work is still limited, but I would guess that each stream would wait in turn for the next? Or do they all fight for the ability to run simultaneously?


20*0.1s ≈ 2s, so that could explain why the watchdog triggers with 20 streams, but not with 10. However that would require a kernel to somehow get “stuck”.

Don’t know how that could happen, unless Nvidia has implemented either the scheduler or the watchdog timer in a weird way.

It could be explained however if the GUI is not treated like another stream with kernel invocations, but requires the whole GPU for itself. Because all the individual kernels overlap, the GUI would not get access in between and the watchdog triggers.

In that case you would have to insert a [font=“Courier New”]cudaThreadSynchronize()[/font] before you reach one timeout’s worth of kernels. :ermm:

EDIT: Checking the documentation it should be enough to call [font=“Courier New”]cudaStreamQuery()[/font] on a stream. That would stop new kernels from launching until all previously scheduled kernels are finished, but it would not hog the CPU.

This would be an odd watchdog implementation. From what I gather, it is possible to make long calculations by splitting up the calls into multiple kernels. What I’m trying to do is in my mind the same, having 2 seconds of calculations spread over several kernel launches.


Changing driver from the 270.32 version to 263.06 version does not help anything.


You will see the same behavior even when launching kernels in the same stream, one after another. All kernels are put into a queue and control is not returned until the last kernel is finished. So just splitting the code into multiple kernels is not enough, you have to stick calls to cudaThreadSynchronize() in between.

Is there any way to setup the CUDA environment so that the GUI always has access to, e.g. 3 SM’s, and then have the remaining (in my case 12) SM’s process my program?


Not that I know of. But you can disable the timeout through registry, as described here

This would be wonderful, but CUDA can’t split the device between two processes. I’m certain that this ability has to be in the development pipeline, although I worry it might require new hardware to achieve this. (Much as Fermi-class hardware was required for multiple kernels in the same process to co-execute.)

Hi All

I think that the timeout error is because you can only execute uo to 16 Kernels at same time. If you are trying to execute 20 kernels, it returs the error.

No, the driver will happily queue hundreds or even thousands of kernel launches in a stream without any sort of error. The underlying is probably that the original poster was using WDDM Windows, and more recent and toolkits drivers have switched to a batched execution model, so that all the work in a given stream becomes a single batch which get fused into a single execution action on the GPU without any yielding to the display manager. This can trigger the driver watchdog timer and result in the batched commands getting killed by the display manager, even though none of the kernels themselves are anywhere near long enough to trigger the watchdog timer individually.