Recovering after CUDA crashes GPU

Its fairly common for the Windows display to freeze up completely when a bug in one of my CUDA kernels causes the GPU to crash. Sometimes Windows seems to be able to recover (the screen with freeze for second but it will come back with message about the display drivers failing), but more often than not the display will just hang permanently. Currently I’m using GeForce GTX 580 on Windows 7 with Driver Version: 8.17.12.7081.

Is there any way to recover from this situation without rebooting (which makes the iteration time for debugging the kernel annoyingly long). The machine is clearly still running, its just the display that has crashed. When I try and remote desktop to a box in this state I’ll get a username/password window but remote login will hang when it tries to to log-in.

Is there some kind of non-GUI based remote console I can use to log and reset the display as you would on a UNIX-based system? Or is there some way to make driver more fault tolerant (some kind of slow debug mode where bad kernels are less likely to crash it).

I guess that, on Windows, the easiest way is to install a second card and debug the code on this card. Is some crashes occur, your display won’t freeze.

Hi, I have the same problem. Sometimes the screen freezes permanently, but once “Blue Screen” showed and restarted computer. I have only one GTX 580 and I can debug it remotely. But this problem is very annoying. Isn’t there any way to force windows (Windows Server 2008) not to use GPU for screen? I mean, theoretically OS can use only CPU for its purposes. Then, if GPU crashes, it won’t freeze screen. Does anyone know the solution of this problem? (Except buying second Graphics Card which costs $500)

Same problem, just the occasional random freeze and have to reboot…annoying.

Your graphics card doesn’t have to be anywhere near as powerful as your compute card.

Just a thought…

Why do you not use debuggers?

I didn’t get it.

When I say, buying second video card, it means I have to buy the same card as I already have, right? Doesn’t Nvidia SLI need both cards to be same? or, is it possible to have another video card, which is not as good as GTX 580? Can they work together? I mean, just to run OS on one card and run computation on GTX 580.

SLI configuration is by no means necessary to support CUDA operations. It’s intended to provide increases in graphics processing power for the desktop.

The WDDM used by both Vista and Win7 allows discrete, independent cards of the same brand to operate as long as they use the same driver, and cards with drivers from seperate brands are OK as well regardless. You could get a cheap, $40 GT 240 to handle your graphics for the most part, working your way up the line all the way to the GTX 580 until you hit as strong a card as you need and be just fine.

The GTX 580 you have can be moved to the secondary slot and still be used for display support on another monitor when not dedicated to GPGPU, and it’ll help ease the pain of recovery if there’s no monitor attached while it is computing and crashing during debug.

Shoot, some folks even use on-board graphics to display their output if the demand’s not too high…

Though it seems like that would be a pain if you need switch back to your main card for gfx intensive activities. It seems like there should be some way to remotely log into a windows box after the GFX card has crashed and restart it. Or a way of running the driver in fault tolerance (but slow) mode that would safely recover from errors.

Granted, it’s far from being the most elegant of solutions and barely serves as stop-gap. But the answer was merely against the concern for a high-priced card to support independent video rather than the ability to serve as a solution to a lack of remote capabilities should things go that way…

Thanks for your reply! I have old 8800GT Video Card, so I’ll test your solution. I was said that only the same Nvidia cards can be put together :(

Thank you once again :)