Have you seen this error message while running a CUDA program under Windows Vista or Windows 7?
This message is telling you that the Timeout Detection and Recovery (TDR) feature of Windows Vista’s driver model (WDDM) has been triggered because the CUDA kernel (or batch of kernels*) you were running took longer than to complete than the configured timeout period allows. By default, this timeout period is two seconds in Windows Vista and Windows 7.
*This can even happen with really short-running kernels if there are a lot of such kernel launches queued up on each others’ heels, because in some instances the driver may batch up several kernel launches and submit them all at once, at which point WDDM requires all of the launches in the batch to run to completion within a single timeout period.
In Windows XP, there was a similar (though longer) timeout, which if exceeded would typically bugcheck (blue screen) the machine. The machine would typically appear frozen or hung until the timeout was reached. To make this failure mode more user-friendly, Microsoft reduced the timeout to 2 seconds starting in Windows Vista and introduced this driver recovery process. While useful for typical interactive graphics applications, this can be problematic for non-graphics (compute) kernels. This is especially true when you have a kernel that might run in far less than two seconds on a higher-end GPU but that takes longer on a lower-end GPU.
Microsoft has more information about the TDR mechanism and how to configure it on their website at http://www.microsoft.com/whdc/device/displ…dm_timeout.mspx . Note that if you change the registry keys described on that page (e.g., to increase the timeout period or to disable the timeout mechanism entirely), you MUST REBOOT before the registry setting changes take effect.
If changing registry keys is not an option for you, you will need to split up your kernels into pieces that you can be certain will run within 2 seconds even on the lowest-end GPUs you expect your application to run on; you could have this be different per GPU for example by scaling the runs based on the number of SM’s in the GPU.
[This issue is also mentioned in the Known Issues section of the CUDA Toolkit’s release notes.]