Kernel start timeout with n-body SDK demo (2 Tesla C1060s, Windows 7 64-bit)

Hello everyone,

I’d love any fixes, help, pointers, tips, or tricks anyone has that would resolve the following key issue:

Running the suggested n-body simulation mentioned on (with the exact arguments specified on that page) on either of my Tesla C1060 cards causes the driver to crash and restart.

The specific error I’m getting from the nbody.exe program itself is:

cudaSafeCall() Runtime API error in file <./nbody.cpp>, line 291 : the launch timed out and was terminated.

As a quick sanity check, I tried running the same demo program with the number of bodies argument modified from --n=131072 to --n=13107 (lopping off a power of ten or so), and that completes within around 1100 ms. I then tried modifying it again to --n=31072 (lopping off a power of ten from the original, but almost 3x the previous reduced sanity check) also timed out.

I’m hosting these two cards in an Intel i7 920 machine running Windows 7 Ultimate 64-bit, using the most current (at least, as of September 27, 2009) 64-bit NVIDIA WHQL CUDA drivers. My display card is a PNY GeForce 8400 (512 MB) installed in a PCI slot. I am not running the n-body demo on the PNY PCI card.

I’ve done some digging around here and there and the issue seems to be that the CUDA kernel is not starting up on the Teslas in time (i.e. within five seconds, or so I’ve read), so the Windows watchdog timer expires and the watchdog kills and restarts the driver. I’ve seen some information pointing to a registry key change that might work, but is ill-advised; I don’t know if that information is current for Windows 7. Additionally, I read something suggesting that cards that are not driving any displays should not be subject to the watchdog timer, but, conversely, since the system is using the same driver for all three cards, it would seem that the driver will get killed and restarted by Windows even if the kernels are only being set to run on the Tesla cards.

My apologies in advance for the lack of more detail in this initial post. I’ll post logs and excerpts from tests I’ve run, as well as more complete system information later on, if needed or helpful.

Thanks in advance.

no, the problem is that the kernel doesn’t complete within the TDR window, TDR triggers, and the driver is reset (killing the app).

set TdrLevel to 0 as described in this article:…dm_timeout.mspx

Thanks for the prompt reply! I’ll implement this in a few hours when I get back to the workstation and report back.

Is this already previously documented somewhere that I missed?

Thanks again!