How to proceed / reset device from an error 700 or 716 in cuda driver API?

Hi!

I am currently doing my first steps with cuda, but got some OpenCL experience.
The program I wrote so far uses the cuda driver api to utilize cuda - mostly because this was I can dynamically load cuda if available in the system and fall back to other methods for computing else.

I observed that sometimes when my application hits a GPU with too much undervolting my kernel might fail with an error 700, some times 716, so memory access errors. This is rare enough not to be a problem, but I would like to know how to recover from this errors. I already learned that when this happens the context gets unusable - in fact not only the context of the card where the error occurred but all active contexts - usually my application is used in systems running multiple cards, each with its own runner thread.

Now I observed: when I close my application and restart it immediately all devices boot up fine, so the problem is not so deep the driver completely crashed. So I tried to destroy my thread local contexts and also did call "cuDevicePrimaryCtxReset"at the end of the thread for each of the cards. I got back error code 0, so I assume the call did work. Still I was not able to spawn fresh threads with new contexts for my cards - whenever trying to create a fresh context I still got error 700 or 716 back - just in the sense of the api “Note that this function may also return error codes from previous, asynchronous launches.”.

So my question is how to clear this error so I can continue with a fresh kernel without needed to restart my application completely? Because as long as I keep my host memory structures the previously achieved results are not lost and it would be rather easy to continue them with a fresh thread - if I would be able to setup one ;)

Increase the voltage and reset the device (a power-on cold boot would be best). Lowering the voltage means transistors switch more slowly. They may switch slowly enough to violate setup and hold requirements of flip-flops, causing data to be corrupted. In the best-case scenario, this will trigger an error of some sort pretty much right away. If you are not so lucky, you won’t notice, the program will continue with incorrect data, and time and energy will be wasted on useless computation. Worse things may happen in final consequence. Faulty hardware states may persist until the root cause is eliminated and a known-good state is enforced.

see here

Either something changed in the meantime or I am doing it not right then - because as I mentioned above I did (in that order)

  • delete all memory allocations in all the contexts (not only the failed one, but all running one)
  • did delete the context
  • did call cuDevicePrimaryCtxReset for each of the cards - with return code 0

Then I even went for a barrier waiting for all running threads / contexts active to reach this point
Still the call to cuCtxCreate to get me a new context then gave me the return code 700. So what did I make wrong? According to the stackoverflow post this should suffice, but it does not.

@njuffa : Well ofcourse I can do so on my end … but you will see that the settings of my customers are not in my control. I am just testing extreme settings here to make sure nothing badly surprising happens when the card is not run under ideal conditions.

When you test semiconductor devices in extreme conditions, overclocking and undervolting are roughly equivalent in terms of their failure modes. In order to guarantee that the entire chip (i.e. every last flip-flop) is back into a correct and consistent state you must restore the vendor-specified nominal operating conditions and power cycle the device.

You are, of course, free to ignore this advice. In which case, all bets are off.

I am fully aware of the implications of undervolting - but I would like to stick to my original question, because well - consider the situation the application kernel very rarely (less then 1 out of >100B cases) indeed calculates a defect memory address - because of maybe approximate math that still is preferred over exact one due to performance reasons).
In this case the kernel call - or rather the next stream sync - would also trigger a error 700. How to recover then?

My observation is that Nvidia and AMD cards behave rather diametrical.
When such an error happens on one Nvidia card, all cuda cards running in my application stop working, which is bad. On the other hand a simple exit(0) and an external bash script restarting the application gets them all running again, which is good.
Amd cards under Linux are … well there only the failing card gets stuck, all others continue working fine, which is good. But the one failed is only working fine again after a full system reboot, which is pretty ugly.

My question is: what does a full program restart different to using cuDevicePrimaryCtxReset? How to achieve the full reset without closing the program structures completely? This is what I would like to know.

You have to terminate the owning process. That is the final point of the stackoverflow post:

" Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the “owning” process must also terminate. See here."

I don’t think that is documented anywhere, and I won’t be able to answer that question.

The only method I am aware of is linked above. It involves writing a multi-process application. And I recognize that may not meet your goals or be what you had in mind.

You can always request changes in CUDA behavior by filing bugs. The method to do so is linked in a sticky post at the top of this forum.

Good Luck!

Ahh, now we come closer to my issue.
My application is already multithreaded, but it seems it was not good enough
Currently the structure is that the primary thread call cuInit() and then the worker threads create own contexts for each card to run on. This worker threads do end - but the one that did call the init did not. Maybe that is the issue here. Thanks for your answer, I will try changing this.

Based on the testing I have done, I wouldn’t conflate thread and process, and specifically the thing that I was able to get working was the multi-process case. But you’re welcome to experiment with your ideas to see if you can get it to work on the driver API side using just threads. But on the runtime API side I’m fairly certain that the owning process must terminate.

1 Like

Can confirm that it happens on the process level. This will make it rather complicated and needs some rework indeed … but also adds the opportunity to protect the other cards when they all run in different processes. Well that’s going to be some work…

Thanks anyways :)