Avoiding driver timeouts How do I avoid driver timeouts?

It appears that if a kernel keeps the device busy for more than a couple of seconds, Windows or the Nvidia driver resets the device and trashes everything. Sigh.

Obviously, I have to split my process into multiple smaller units the will execute in sequence. Because of the enormous complexity of the algorithm, preserving states between kernel launches is going to be a major PITA. So, I have a few questions before I bite the bullet:

  1. Is there any way I can disable this timeout, or set it to a larger value?

  2. What, exactly, do I need to do between kernel launches to reset the timeout counter? I tried an experiment in which I just repeatedly launched a smaller kernel. After several dozen consecutive back-to-back launches, each running for about a quarter of a second, the driver crashed on a timeout. So obviously just letting the kernel finish and then re-launching it is not the answer. I cannot do a cudaThreadExit because that would clear global memory and force me to transmit a boatload of data between kernel launches. What do I need to do between launches?

Tim

The timeout is supposed to be per-kernel launch (it certainly worked that way on any other system I’ve used), so if launching short kernels still triggering driver crashes, you have a different problem.

Seibert - I think you are right. That test was for a large, complex kernel. I just wrote a small custom kernel and could not reproduce my prior results. So it appears to be per-kernel. Thanks! But this still leaves me with the horrendous problem of rewriting my gigantic, complex kernel in such a way that it can execute in sequential launch segments That’s so silly! Here I am in an intensely parallel environment, and I need to serialize my code!

While the kernel is running, if I move the mouse the cursor moves on the screen, so Windows is obviously communicating with the video card just fine during the computation. So what is its problem??? Why does it insist on shutting me down? I hope I can find a way to disable that timeout, or at least raise the limit.

Tim

Maybe this will help. Its an IBM Linux-based OS, but the example indicates its possible. I’m thinking its very OS and platform dependent so a specific solution is going to take you deep into the OS driver layer.

http://publib.boulder.ibm.com/infocenter/l…pmiwatchdog.htm

Al

Sounds like your problem is that you have a monitor attached to the card… if I recall corectly, you can only run for 5 seconds at a time on a card with monitor attached (as an overheating precaution I believe). The best solution is to either set it up as a headless node or get a second card for your monitor.

Sounds like your problem is that you have a monitor attached to the card… if I recall corectly, you can only run for 5 seconds at a time on a card with monitor attached (as an overheating precaution I believe). The best solution is to either set it up as a headless node or get a second card for your monitor.

  1. Use a no monitor attched card.
  2. Run your program in Unix/Linux with no x-windows lunched.
  1. Use a no monitor attched card.
  2. Run your program in Unix/Linux with no x-windows lunched.