I am wondering if there is a way of implementing a timeout on a single CUDA call such that if it has not returned after n seconds it can be ‘stopped’ (forced to return/throw), enabling the device to be reset and used for another task?
Obviously just calling cudaDeviceReset() on the main thread after n seconds is a bad idea since it would be pulling resources (e.g. allocated memory) out from under a CUDA-running thread which is using them, leading to memory fault and probable crash.
I can implement a solution where I keep track of time on a particular task between CUDA calls. My question is is it possible to timeout a single call that is taking too long to return, without killing the whole process?
not possible.
I’m not sure why cudaDeviceReset is not an option, but so be it. If your intent is to enable the device to be reset and used for another task, that is exactly what cudaDeviceReset does.
OK, thanks.
My thought was that doing a device reset while another thread was running on the device would/could cause the process to crash. That seemed to be the case when I tried it.
I’m not sure what sort of activity would cause the CPU process to crash just because the device context becomes invalid. However I would think that if you are doing proper cuda error checking at all times, the sudden loss of the device context would be quickly discovered by any thread, before bad things happen (at the moment I can’t think of what those bad things would be).
I guess I would need to see a counterexample where that is not sufficient/effective.