GPU in a bad state - only power cycle helps

e.ping · March 23, 2011, 4:16pm

I have a situation where I am running a lot of different CUDA based programs repeatedly over many days. This works for a while until something eventually crashes with some vague CUDA error. Once this has happened, the GPU is in a compromised state where subsequent CUDA based code does not run properly. For instance, I can run the scalarprod executable that is part of the SDK, and I will get errors ranging from 3e-3 up to INF, or even QNAN.

I’d like to think that whatever my problem is that caused the initial crash wouldn’t just linger until a reboot… Any thoughts on how to deal with this or what the problem may be? For the moment, I am less concerned with the fact that I have a single crash, and more concerned with the fact that nothing works thereafter.

My system is a 64 bit Windows 7 laptop with an NVIDIA 485M processor, running version 265.77 of the driver. I am using version 3.2 of the CUDA toolkit.

On a related note - if someone from NVIDIA is checking this out - when will a new driver for the 485M become available? At this point, I only have the original driver from Sager, the manufacturer of the laptop - the NVIDIA driver download section doesn’t even make mention of the 485M.

Thanks!

Eddie

Sarnath · March 23, 2011, 4:52pm

Welcome to GPU computing! Never run more than 1 app at any given time and be happy…

e.ping · March 23, 2011, 6:59pm

Well, thanks for the welcome… I guess… Turns out I haven’t been simultaneously running any CUDA programs - it has all been sequential. But I will certainly keep your advice in mind. But that still leaves my fundamental question - any guess as to either how/why I entered this state, or, perhaps more importantly, how to get out of it (without rebooting)?

adamjmac · March 24, 2011, 2:33am

I have noticed after working on CUDA code all day, and after a few crashes along the way, games can sometimes appear corrupted. I solve it by restarting, but I agree it shouldn’t be this way. My guess is the GPU is just not as good at isolating processes so that when something crashes, it can completely clean up after itself like an operating system does/should.

pium · March 24, 2011, 10:58am

Sometimes I need to reboot when I am writing at wrong indexes in global memory. It don’t take too long to either have a blue screen (the only way I found on win7!) or to have strange noise on the screen in every application.
Maybe your first crashing thread corrupts the memory on the same way.

e.ping · March 24, 2011, 6:10pm

Thanks for everyone’s comments. The writing at bad indices suggestion does sound plausible - but I have run the code thousands of times with no problem, and when the problem occurs, there is no blue screen and no screen artifacts. Just a GPU that gives bogus results. I am kind of considering writing code that will purposefully have out of bounds write errors to see what happens, but if memory serves, when I have inadvertently done this in the past, CUDA gives me “Unknown Error”, and the GPU wasn’t disabled as a result…

DarkRoom · March 24, 2011, 8:55pm

I have the same problem on a GTX 580. After a normal “launch failed” error caused by a bug in my code under development even good code will cause errors. Usually it’s corrupt data, followed eventually by a hard “launch failed”. A reset won’t help. Only a shut-down will do.

I guess that’s normal in the GPU world. An interesting fact though: I never got this problem on my older GTX 480!