I’d love any fixes, help, pointers, tips, or tricks anyone has that would resolve the following key issue:
Running the suggested n-body simulation mentioned on http://www.nvidia.com/object/tesla_build_your_own.html (with the exact arguments specified on that page) on either of my Tesla C1060 cards causes the driver to crash and restart.
The specific error I’m getting from the nbody.exe program itself is:
cudaSafeCall() Runtime API error in file <./nbody.cpp>, line 291 : the launch timed out and was terminated.
As a quick sanity check, I tried running the same demo program with the number of bodies argument modified from --n=131072 to --n=13107 (lopping off a power of ten or so), and that completes within around 1100 ms. I then tried modifying it again to --n=31072 (lopping off a power of ten from the original, but almost 3x the previous reduced sanity check) also timed out.
I’m hosting these two cards in an Intel i7 920 machine running Windows 7 Ultimate 64-bit, using the most current (at least, as of September 27, 2009) 64-bit NVIDIA WHQL CUDA drivers. My display card is a PNY GeForce 8400 (512 MB) installed in a PCI slot. I am not running the n-body demo on the PNY PCI card.
I’ve done some digging around here and there and the issue seems to be that the CUDA kernel is not starting up on the Teslas in time (i.e. within five seconds, or so I’ve read), so the Windows watchdog timer expires and the watchdog kills and restarts the driver. I’ve seen some information pointing to a registry key change that might work, but is ill-advised; I don’t know if that information is current for Windows 7. Additionally, I read something suggesting that cards that are not driving any displays should not be subject to the watchdog timer, but, conversely, since the system is using the same driver for all three cards, it would seem that the driver will get killed and restarted by Windows even if the kernels are only being set to run on the Tesla cards.
My apologies in advance for the lack of more detail in this initial post. I’ll post logs and excerpts from tests I’ve run, as well as more complete system information later on, if needed or helpful.
Thanks in advance.