In CUDA how to deal with loss of GPU and/driver state, e.g. due to the system cycling through a susp

I’m currently developing a university in-hose interferometry signal processing and visualization tool. One recurring feature request / “bug” report is, that if the computer system is put into some suspend mode (suspend to RAM, hibernation), after resuming the program only displays and computes garbage.

Now I understand that system suspend may put the GPU into some undefined state. However, at least as OpenGL is concerned, driver assisted state recovery is indeed possible; OpenGL recovers just fine.

DirectX of course does support a GPU loss event. However I’m not aware of a CUDA API that provides for the reporting of explicitly GPU loss events.My best guess would be to test for a “device deinitialized” error flag and perform a full reinitialization when that happens.

Is this the right way to do it, or is there a better way?

(You seem to be talking about windows (only?) here.)

I think you’re suggesting an “inferential” approach based strictly on CUDA. You might be able to get that to work, but CUDA itself is unaware of any of this, so you’d have to decide to interpret some catastrophic CUDA context-corruption error as an indicator to do some kind of cleanup/restart (I guess).

Another possible approach at the application level (outside of CUDA) might be to observe an impending hibernate/suspend directly, before it happens, and take appropriate action. The answer here:

[url]Is Windows entering sleep mode or hibernating with C++? - Stack Overflow

suggests that you can capture a windows message telling you of an impending suspend/hibernate:

“The PBT_APMQUERYSUSPEND message is sent when a suspend or hibernate is about to occur”

At a minimum, after capturing that message, you could set a flag in your application which should remove any “inferential” character to how you behave.

No, actually cross plattform. And as a matter of fact the majority of the machines it’s been used on runs some flavour of Linux. And the problem here is, that on some machines (cough mine cough) the system suspend mode is activated through an acpid handler that simply echo mem > /sys/power/state. No, I don’t want to rely on upower or any other of the DBus powered freedesktop.org madness.

We have a similar problem with our linux applications. When thawing from hibernation the GPU memory has corrupted textures and shaders in it. The only fix is to restart the application if it can’t reload everything into gpu memory.

Is there even a kernel mechansim to do this transparently so no application support is required?

BillRyderAtWetaFX it seems that you’ve made some other postings in the linux forum as well, related to suspend/resume observations.

You might want to file bug reports at developer.nvidia.com. It seems you already know about the bug report log tool.

If you can file a bug report for this problem you are having with a detailed set of instructions about how to reproduce it, it may help. If you can let me know the bug report number when you file it, I’ll take a look.

The Programming Guide has always had a vague reference to the CUDA_ERROR_INVALID_CONTEXT error being returned if a mode switch occurs… which I suspect is similar to a GPU suspend/resume?

After a resume occurs, would probing the context with something like a cuCtxGetFlags(…) faithfully return an invalid context error?

Showing how a CUDA app can cleanly survive a suspend/resume might be a good example for NVIDIA to add to the CUDA Samples.