I am currently doing my first steps with cuda, but got some OpenCL experience.
The program I wrote so far uses the cuda driver api to utilize cuda - mostly because this was I can dynamically load cuda if available in the system and fall back to other methods for computing else.
I observed that sometimes when my application hits a GPU with too much undervolting my kernel might fail with an error 700, some times 716, so memory access errors. This is rare enough not to be a problem, but I would like to know how to recover from this errors. I already learned that when this happens the context gets unusable - in fact not only the context of the card where the error occurred but all active contexts - usually my application is used in systems running multiple cards, each with its own runner thread.
Now I observed: when I close my application and restart it immediately all devices boot up fine, so the problem is not so deep the driver completely crashed. So I tried to destroy my thread local contexts and also did call "cuDevicePrimaryCtxReset"at the end of the thread for each of the cards. I got back error code 0, so I assume the call did work. Still I was not able to spawn fresh threads with new contexts for my cards - whenever trying to create a fresh context I still got error 700 or 716 back - just in the sense of the api “Note that this function may also return error codes from previous, asynchronous launches.”.
So my question is how to clear this error so I can continue with a fresh kernel without needed to restart my application completely? Because as long as I keep my host memory structures the previously achieved results are not lost and it would be rather easy to continue them with a fresh thread - if I would be able to setup one ;)