How to reset/unlock wedged card?

Is there a way to reset or otherwise clear out a GPU that has gotten into a state where CUDA apps hang while attempting to allocate the card? I wrote a small test code that performs a ridiculously long loop to see what would happen, and as a result the GPU locked up and I had to reboot the machine in order to get subsequent CUDA apps to run. Until I rebooted the machine, the CUDA apps would seemingly spin (eat 100% CPU on the host) in a CUDA call. Until the reboot, none of my own CUDA apps nor the examples from the SDK would run, they all got wedged spinning in some CUDA API call. For a single developer such as myself this is not a big deal, but it concerns me as we’ll have several other people developing on this system eventually and so the frequency of a wedged GPU may increase as more of us are using it for CUDA testing. This is on a RHEL4u4 test system.

This sounds like a potential bug. Can you provide this test code (and build/run) instructions?

I’ll test the same code again tomorrow when I’m at the lab and if it wedges reliably, I’ll package it up for you and post it here or in the NV developer area for you, whichever works best for you.

John

Providing everything here would be fine.

Thanks,
Lonni

I am seeing the same problem, and my app has some pretty large loops (though not ridiculous). I can only trigger it by trying to CTRL-C the app when the GPU part is running. I submitted a bug report already. I can continue working as long as I let the GPU part run to completion. I am on RHEL4u3.

Yes, I’ve reproduced the bug you submitted. Thanks.

In reading the release notes, it appears that the machine lockup I observed was probably the result of X being active on the GPU at the same time I was using it for CUDA. I’ll run more experiments with the simple test code after I put in another video board so that the CUDA board isn’t being touched by X. If I can still wedge the GPU or hang the machine, I’ll post the test code here. It’s nothing special, just a quadruply nested loop that counts and sets and output value (to preven the compiler from optimizing it out of existence…)

John

I get this problem on XP 64 bit too. Currently I am power cycling after every failed test and it is getting a little frustrating.

Cheers,
John

long shot…

have any of you made any progress on this? i’m working on a remote machine (ubuntu 8.04) that I can’t reboot. I experience exactly the same problem - a kernel is terminated during a somewhat long loop and becomes inaccessible, ie I cannot allocate memory on the device any more.

any help?

have you upgraded to 185.18.08, because you can kill kernels on dedicated compute cards now

no, we’re at 180.22

i’ll chat with my administrators and see if we can’t upgrade. thank you

185.18.08 gives you exclusive mode too, which is also pretty great (although you have to leave nvidia-smi looping in the background if you don’t have X running)