Will cudaThreadSynchronize() truly break up kernel launches to avoid WDM timeout?

My total kernel time is far in excess of the WDM timeout. Disabling timeout through a registry change is not an option, as I cannot ask my client to mess with his registry. So I must break my single large task into multiple small tasks guaranteed to be under the timeout limit. But I saw in another post that the scheduler might launch multiple successive kernels all at once, which would subvert my solution! If I do cudaThreadSynchonize() between launches, will that solve the problem? If not, what do I need to do? Thanks!

Since cuda{Thread,Device,Event}Synchronize and cudaEventQuery require CPU/GPU synchronization the current behavior is to flush the CUDA driver software queue on WDDM. cudaEventQuery(0) is the primary method used to flush the software queue as this operation does not require full CPU/GPU synchronization.

The software queue and operations on the software queue are not documented as part of the CUDA runtime. Performing a full CPU/GPU synchronization using cuda{Thread,Device}Synchronization should always break up work but will have higher overhead.

Greg - Thanks! Just to confirm I’ve got this right before I go to all the work of breaking up my large kernel, the recommended approach is this:

for iterations until task is done {
launch kernel

As long as each ‘launch kernel’ is under the WDM timeout, it should be fine, even though the total time for this loop may be very long. Correct? Thanks!


Correct. Nsight Visual Studio Edition 3.0 supports two different trace features that can be helpful in understanding the CUDA driver software queue and WDDM KMD driver.

Run Visual Studio
Executed Nsight | Start Performance Analysis …
Set Activity Type to Trace Application
In Trace Settings enable System provider and enable the sub-option WDDM Base Events
In Trace Settings enable CUDA provider and enable the sub-option Driver Queue Latency

When you enable WDDM Base Events a new set of timeline rows will be displayed under the System Row. This will show the WDDM hardware queue, command buffer execution, and memory paging.

When you enable CUDA Driver Queue Latency a new row will appear under Processes\Process\CUDA\Context\Counters row that will show the depth of the CUDA software queue and the hardware queue. You can use this depth graph and the CUDA API call row to make sure that you are submitting work to the GPU after each launch given the for loop you have shown.

Greg - Thank you!!! This was an extremely useful response. I had no idea one could do this.


I’m new to cuda programming and I have the same problem. My for loop works slowly but well with cudaThreadSynchronize(), but timeouts with cudaEventQuery(0).
Can you tell me what am I doing wrong?

(I use a notebook with NVS4200M and CUDA Toolkit 6.0)


perhaps post some code, and perhaps create a new topic altogether