I am having the same problem with a Dell Performance 690, using the same software versions you mention. Anytime code runs on the GPU for more than about 7.5 seconds, the cuda call returns prematurely and the following error is emitted to the system log:
NVRM: Xid (000a:00): 8, Channel 00000001
Depending on the code following the GPU call, I may get:
terminate called after throwing an instance of ‘bool’
This is a documented issue if you read the driver release notes. I believe the solution is to either run on a GPU that’s not being managed by X. I myself just changed my kernels so that no single kernel
runs for more than a few seconds at a time, which was actually better for me anyway as it made my code easier to multithread for multiple GPUs later on…
Q. My kernel log contains messages that are prefixed with “Xid”; what do these
A. “Xid” messages indicate that a general GPU error occurred, most often due
to the driver misprogramming the GPU or to corruption of the commands sent
to the GPU. These messages provide diagnostic information that can be used
by NVIDIA to aid in debugging reported problems.
I use X on this machine, if at all, only through a non-8800 card.
Like you say, the current solution is to run smaller amounts of work per chunk…
Any windows or linux folks running more than 7 or 8 seconds of happy computing on the GPU in a single cuda call?
Google suggests that Ye olde NVRM: Xid errors may be related to a larger context of issues… and I believe the system is convinced that the 8800 is not associated with an X display.
It may be useful information to know if the windows software stack on the same box runs fine. I might give this a whirl at some point. I’m not sure having a windows box around with one or more 8800s is conducive to productivity, however :D .
We are avoiding the time out problem by invoking our kernel to run smaller data chunks, each taking abut 2-4 seconds. However, after running like this for 12 minutes for a very large data set, we get an “unspecified driver error”. Has anyone seen problems like this yet, or work on number crunching that takes as long as 12 minutes?
We’re running some huge averaged Coulombic potential jobs on 3 GPUs at a time for a few hours. Ultimately they currently end up crashing eventually because CUDA appears to have a memory deallocation bug that occurs over a long period of time when thousands of kernels have been run. Eventually, no kernels will run anymore and the machine has to be rebooted. Until the progressive memory leak runs the cards out of memory (usually takes about 2 days) they run fine. Each kernel invocation only ends up running for about 3 seconds, but we keep them cranking along until one of these issues I’ve described occurs, requiring a reboot.
I’ve had the card in use for nearly 4 days solid, but that was divided into separate jobs which would run for 6-12 hours and exit. Within each job, the kernel calls were very short (tens of miliseconds), but there would be 6 million calls per job. So far I have not been able to jam anything up, even when I abort the jobs in the middle of running.
Yes, I actually have two bugs filed. One for the slowly occuring leak that occurs over several days which requires a reboot to cure, and another for what appears to be a bug with cudaFree not deallocating blocks of memory from within child host threads for our multi-GPU runs. I’m working on making a simpler test case so you guys can reproduce it more easily.