In CUDA how to deal with loss of GPU and/driver state, e.g. due to the system cycling through a susp

datenwolf · September 19, 2015, 1:04pm

I’m currently developing a university in-hose interferometry signal processing and visualization tool. One recurring feature request / “bug” report is, that if the computer system is put into some suspend mode (suspend to RAM, hibernation), after resuming the program only displays and computes garbage.

Now I understand that system suspend may put the GPU into some undefined state. However, at least as OpenGL is concerned, driver assisted state recovery is indeed possible; OpenGL recovers just fine.

DirectX of course does support a GPU loss event. However I’m not aware of a CUDA API that provides for the reporting of explicitly GPU loss events.My best guess would be to test for a “device deinitialized” error flag and perform a full reinitialization when that happens.

Is this the right way to do it, or is there a better way?

Robert_Crovella · September 19, 2015, 2:57pm

(You seem to be talking about windows (only?) here.)

I think you’re suggesting an “inferential” approach based strictly on CUDA. You might be able to get that to work, but CUDA itself is unaware of any of this, so you’d have to decide to interpret some catastrophic CUDA context-corruption error as an indicator to do some kind of cleanup/restart (I guess).

Another possible approach at the application level (outside of CUDA) might be to observe an impending hibernate/suspend directly, before it happens, and take appropriate action. The answer here:

[url]Is Windows entering sleep mode or hibernating with C++? - Stack Overflow

suggests that you can capture a windows message telling you of an impending suspend/hibernate:

“The PBT_APMQUERYSUSPEND message is sent when a suspend or hibernate is about to occur”

At a minimum, after capturing that message, you could set a flag in your application which should remove any “inferential” character to how you behave.

datenwolf · September 19, 2015, 4:09pm

No, actually cross plattform. And as a matter of fact the majority of the machines it’s been used on runs some flavour of Linux. And the problem here is, that on some machines (cough mine cough) the system suspend mode is activated through an acpid handler that simply echo mem > /sys/power/state. No, I don’t want to rely on upower or any other of the DBus powered freedesktop.org madness.

BillRyderAtWetaFX · October 12, 2015, 8:22pm

We have a similar problem with our linux applications. When thawing from hibernation the GPU memory has corrupted textures and shaders in it. The only fix is to restart the application if it can’t reload everything into gpu memory.

Is there even a kernel mechansim to do this transparently so no application support is required?

Robert_Crovella · October 12, 2015, 11:29pm

BillRyderAtWetaFX it seems that you’ve made some other postings in the linux forum as well, related to suspend/resume observations.

You might want to file bug reports at developer.nvidia.com. It seems you already know about the bug report log tool.

If you can file a bug report for this problem you are having with a detailed set of instructions about how to reproduce it, it may help. If you can let me know the bug report number when you file it, I’ll take a look.

allanmac · October 12, 2015, 11:42pm

The Programming Guide has always had a vague reference to the CUDA_ERROR_INVALID_CONTEXT error being returned if a mode switch occurs… which I suspect is similar to a GPU suspend/resume?

After a resume occurs, would probing the context with something like a cuCtxGetFlags(…) faithfully return an invalid context error?

Showing how a CUDA app can cleanly survive a suspend/resume might be a good example for NVIDIA to add to the CUDA Samples.

Topic		Replies	Views
How to recover CUDA after the display driver has crashed and recovered(caused by cuda crash)? CUDA Programming and Performance	7	1638	October 23, 2014
Experiences with Hibernate/Suspend Linux	0	498	November 2, 2017
GPU in a bad state - only power cycle helps CUDA Programming and Performance	6	2404	March 24, 2011
cuda (375.66) is failing with uknown error 30 after suspending Ubuntu 16.04 Linux	3	1742	September 5, 2017
Unable to determine the device handle for GPU. GPU is lost. Reboot the system to recover this GPU GPU - Hardware cuda , cudnn	0	559	November 11, 2020
stuck CUDA program how to restart GPU when CUDA gets stuck CUDA Programming and Performance	0	1356	August 10, 2010
Recovering after CUDA crashes GPU CUDA Programming and Performance	10	25228	November 18, 2011
GPU lost after sometime of using it CUDA Setup and Installation	0	590	February 11, 2020
Simulate GPU Failure CUDA Programming and Performance	1	1492	May 23, 2016
how can I detect a vanished GPU card programmatically, and suggestions on root causing CUDA Programming and Performance	2	967	April 28, 2015

In CUDA how to deal with loss of GPU and/driver state, e.g. due to the system cycling through a susp

Related topics