how can I detect a vanished GPU card programmatically, and suggestions on root causing

Hello,

I’ve inherited some CUDA code in several DLLs that in my months of playing with it has proved fairly solid till recently. I have anecdotal rather than detailed diary and configuration history. Running on Windows 8.1 x64. The NVIDIA card is GTX 580. It is not hooked to the display. I should mention the workload scenario has changed to be more complicated, the DLLs are being called by a third party app and I’m still gaining experience with that.

After some usage of my CUDA workload, the NVIDIA card has vanished. I cannot say its purely isolated to the CUDA workload at this point, but I’m trying to establish if that’s the case. I’m also considering hardware issues.

After seeing cudaSetDevice fail strangely (code presently doesn’t journal the rc but I’ll fix that), I finally realized the card was gone. You can see the nvidia-smi outputs below. Also, in the debugger in the vicinity of the cudaSetDevice failure, I seemed to see some access violations either in nvcuda.dll or cudard64_42_9.dll, unfortunately was not able to record the details before a BSOD. This was before I realized the card was astray. I had to cold reboot to get the card back, and it’s fine at present.

So my questions:

  1. It would be easier on me to fail out earlier than where it’s currently calling cudaSetDevice – is there a good r-u-there API I can call, an API without side effects? I was thinking of adding the check into the dll attach handler.
  2. Suggestions to root cause? Looked through eventvwr didn’t show anything suggestive. I expect this to happen again, in which case I’ll check device manager too.
  3. Are there some journaling options in CUDA I can turn on during running my workload? By the way, in my other test scenario in which it’s not a third party app, I ran it through cuda-memcheck and no issue was found, but I plan to add some stuff to make it loop and stress it more.

(after cold reboot and AC unplug)
Thu Apr 23 19:53:23 2015
±-----------------------------------------------------+
| NVIDIA-SMI 340.62 Driver Version: 340.62 |
|-------------------------------±---------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 580 WDDM | 0000:02:00.0 N/A | N/A |
| 40% 41C P0 N/A / N/A | 1497MiB / 1535MiB | N/A Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
±----------------------------------------------------------------------------+

(after warm reset … did not come back)
Thu Apr 23 19:29:37 2015
±-----------------------------------------------------+
| NVIDIA-SMI 340.62 Driver Version: 340.62 |
|-------------------------------±---------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 ERR! WDDM | ERR! ERR! | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 1497MiB / 1535MiB | ERR! ERR! |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 ERROR: GPU is lost |
±----------------------------------------------------------------------------+

Thanks.

cudaError_t stat = cudaFree(0);

should have no side effects, and the error in stat will tell you if there is a problem with the CUDA system.

I would suggest monitoring the GPU temperature up until the point of failure.

Thanks very much. I’ll will try that in my code. I’m not sure but I think I saw the card at 61C using MSI AfterBurner. I can’t get NV System Monitor to run. The card is alive at the moment but it doesn’t always activate lately, even after a cold boot with the AC unplugged for a while. When it works it is as fine as ever.