How to reset/unlock wedged card?

tachyon_john · February 22, 2007, 1:52am

Is there a way to reset or otherwise clear out a GPU that has gotten into a state where CUDA apps hang while attempting to allocate the card? I wrote a small test code that performs a ridiculously long loop to see what would happen, and as a result the GPU locked up and I had to reboot the machine in order to get subsequent CUDA apps to run. Until I rebooted the machine, the CUDA apps would seemingly spin (eat 100% CPU on the host) in a CUDA call. Until the reboot, none of my own CUDA apps nor the examples from the SDK would run, they all got wedged spinning in some CUDA API call. For a single developer such as myself this is not a big deal, but it concerns me as we’ll have several other people developing on this system eventually and so the frequency of a wedged GPU may increase as more of us are using it for CUDA testing. This is on a RHEL4u4 test system.

netllama · February 22, 2007, 2:59am

This sounds like a potential bug. Can you provide this test code (and build/run) instructions?

tachyon_john · February 22, 2007, 6:09am

I’ll test the same code again tomorrow when I’m at the lab and if it wedges reliably, I’ll package it up for you and post it here or in the NV developer area for you, whichever works best for you.

John

netllama · February 22, 2007, 3:56pm

Providing everything here would be fine.

Thanks,
Lonni

mstock · February 22, 2007, 5:07pm

I am seeing the same problem, and my app has some pretty large loops (though not ridiculous). I can only trigger it by trying to CTRL-C the app when the GPU part is running. I submitted a bug report already. I can continue working as long as I let the GPU part run to completion. I am on RHEL4u3.

netllama · February 22, 2007, 5:18pm

Yes, I’ve reproduced the bug you submitted. Thanks.

tachyon_john · February 23, 2007, 7:38am

In reading the release notes, it appears that the machine lockup I observed was probably the result of X being active on the GPU at the same time I was using it for CUDA. I’ll run more experiments with the simple test code after I put in another video board so that the CUDA board isn’t being touched by X. If I can still wedge the GPU or hang the machine, I’ll post the test code here. It’s nothing special, just a quadruply nested loop that counts and sets and output value (to preven the compiler from optimizing it out of existence…)

John

JohnW · February 19, 2009, 11:47pm

I get this problem on XP 64 bit too. Currently I am power cycling after every failed test and it is getting a little frustrating.

Cheers,
John

snoo · May 13, 2009, 8:59pm

long shot…

have any of you made any progress on this? i’m working on a remote machine (ubuntu 8.04) that I can’t reboot. I experience exactly the same problem - a kernel is terminated during a somewhat long loop and becomes inaccessible, ie I cannot allocate memory on the device any more.

any help?

tmurray · May 13, 2009, 9:02pm

have you upgraded to 185.18.08, because you can kill kernels on dedicated compute cards now

snoo · May 13, 2009, 9:28pm

no, we’re at 180.22

i’ll chat with my administrators and see if we can’t upgrade. thank you

tmurray · May 13, 2009, 9:33pm

185.18.08 gives you exclusive mode too, which is also pretty great (although you have to leave nvidia-smi looping in the background if you don’t have X running)

Topic		Replies	Views
is there any easy ways to reset GPU CUDA app hang up CUDA Programming and Performance	7	3650	November 20, 2008
CUDA becomes unusable until reboot After kernel with infinite loop CUDA Programming and Performance	3	7248	March 3, 2008
GPU breaks down after error CUDA Programming and Performance	3	10717	November 16, 2010
GPU breaks down after error CUDA Programming and Performance	1	769	November 3, 2010
Multiple simultaneous CUDA applications (system crash on 100.14.11) CUDA Programming and Performance	14	12568	October 8, 2007
unspecified launch failure kernel fails if a loop is too long CUDA Programming and Performance	8	42841	April 25, 2007
Kernel Interruption in Command Line Application CUDA Programming and Performance	1	7373	July 15, 2011
kernels timeout or hang intermitently CUDA Programming and Performance	9	3730	July 25, 2013
Infinite loop on GPU with CPU usage ? CUDA Programming and Performance	3	2128	April 26, 2013
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37699	April 23, 2009

How to reset/unlock wedged card?

Related topics