I made recently C++ library ( http://sourceforge.net/projects/gxlibrary/ ) for VisualStudio 2012+ that allows programmer to easily make long-running simulations or computations that would run on any available GPU (CUDA for NVIDIA, AMP for others) and on CPU.
It allows writing kernel function only once, and then it compile that same kernel for CUDA and AMP and CPU - and even is able to run same kernel on all GPUs of remote PCs with only adding two lines of code.
Library also insulate programmer from need to know CUDA (or AMP, or multithreading) at all, and among other things, it automatically distribute work among all GPUs - taking care that each GPU call should not exceed 0.5sec to avoid TDRs.
Reason why I ask question from title is that , even when gxLibrary takes care to measure speeds of kernels and call them with only 0.5sec work, sometimes (rarely) it can happen that some GPU will TDR ( usually when some other app uses display intensively).
I already made gxLibrary resilient enough to survive such TDR crash and save work of that GPU before crash, and rest of work is finished by other GPus. It even reactivate failed GPU if it was using AMP mode.
But I would like to be able to reactivate even CUDA GPU if it timeout in middle of long-running calculation or simulation. I tried reallocating all needed resources on CUDA GPU that failed, with or without cudaDeviceReset() before that, but in either case I was getting error 46 (“all CUDA-capable devices are busy or unavailable”)
I tested same approach with creating completely new thread, and in that case it works. I already have separate thread for each separate GPU (and for host side too), BUT … creating new thread is out of question due to many reasons (connected remote GPUs, resources allocated for host side, …)
So I wonder if really CUDA is not able to recover in same thread from TDR? So far this is one of rare situations where Microsoft C++ AMP shows clear advantage over CUDA (since it recovers without issue in same thread).