problems with cuda on linux

Hi everyone.

I’m experiencing some problems with cuda on linux (Fedora 6).

I’m using the version 0.8 of cuda and NVIDIA-Linux-x86-1.0-9751 for the drivers.

The result is:

This problem occures only if I really load the device. For small amount of data everything is alright.

Any suggestion?

Thank!

I am having the same problem with a Dell Performance 690, using the same software versions you mention. Anytime code runs on the GPU for more than about 7.5 seconds, the cuda call returns prematurely and the following error is emitted to the system log:

NVRM: Xid (000a:00): 8, Channel 00000001

Depending on the code following the GPU call, I may get:

terminate called after throwing an instance of ‘bool’
Aborted

from my code…

This is a documented issue if you read the driver release notes. I believe the solution is to either run on a GPU that’s not being managed by X. I myself just changed my kernels so that no single kernel

runs for more than a few seconds at a time, which was actually better for me anyway as it made my code easier to multithread for multiple GPUs later on…

John

Looks to me like the Driver README.txt says:

Q. My kernel log contains messages that are prefixed with “Xid”; what do these
messages mean?

A. “Xid” messages indicate that a general GPU error occurred, most often due
to the driver misprogramming the GPU or to corruption of the commands sent
to the GPU. These messages provide diagnostic information that can be used
by NVIDIA to aid in debugging reported problems.

I use X on this machine, if at all, only through a non-8800 card.

Like you say, the current solution is to run smaller amounts of work per chunk…

Any windows or linux folks running more than 7 or 8 seconds of happy computing on the GPU in a single cuda call?

Sorry, the note I was referring to was actually in the CUDA release notes, not the driver. It says:

o Individual GPU program launches are limited to a run time

of less than 5 seconds on the device. Exceeding this time

limit usually causes a launch failure reported through the

CUDA driver or the CUDA runtime, but in some cases hangs the

entire machine, requiring a hard reset. For this reason it

is recommeded that CUDA is run on a G80 that is NOT attached

to an X display.

o While X does not need to be running in order to use CUDA,

X must have been initialized at least once after booting

in order to properly load the NVIDIA kernel module. The

NVIDIA kernel module remains loaded even after X shuts down,

allowing CUDA to continue to function.

Hope that helps.

John

Google suggests that Ye olde NVRM: Xid errors may be related to a larger context of issues… and I believe the system is convinced that the 8800 is not associated with an X display.

It may be useful information to know if the windows software stack on the same box runs fine. I might give this a whirl at some point. I’m not sure having a windows box around with one or more 8800s is conducive to productivity, however :D .

We are avoiding the time out problem by invoking our kernel to run smaller data chunks, each taking abut 2-4 seconds. However, after running like this for 12 minutes for a very large data set, we get an “unspecified driver error”. Has anyone seen problems like this yet, or work on number crunching that takes as long as 12 minutes?

Hi,

We’re running some huge averaged Coulombic potential jobs on 3 GPUs at a time for a few hours. Ultimately they currently end up crashing eventually because CUDA appears to have a memory deallocation bug that occurs over a long period of time when thousands of kernels have been run. Eventually, no kernels will run anymore and the machine has to be rebooted. Until the progressive memory leak runs the cards out of memory (usually takes about 2 days) they run fine. Each kernel invocation only ends up running for about 3 seconds, but we keep them cranking along until one of these issues I’ve described occurs, requiring a reboot.

Cheers,

John Stone

I’ve had the card in use for nearly 4 days solid, but that was divided into separate jobs which would run for 6-12 hours and exit. Within each job, the kernel calls were very short (tens of miliseconds), but there would be 6 million calls per job. So far I have not been able to jam anything up, even when I abort the jobs in the middle of running.

John, have you submitted a bug report on this on the registered developer site?

Thanks,
Mark

Mark,

Yes, I actually have two bugs filed. One for the slowly occuring leak that occurs over several days which requires a reboot to cure, and another for what appears to be a bug with cudaFree not deallocating blocks of memory from within child host threads for our multi-GPU runs. I’m working on making a simpler test case so you guys can reproduce it more easily.

John

This is really annoying.

Is it a bug that will be fixed?

Or are there technical reasons that make this unavoidable?

Or are there political reasons, i.e. will a production version of CUDA without this restriction cost money? This should be known before spending money on a G80 card.

Regards

Markus

The system hang bug will be fixed. The time limit is avoidable by using a non-display GPU for CUDA computations.

Mark

Which time limit? The 5 seconds per kernel invocation, eceflyboy’s 12-minute unspecified driver error, or John’s several-hour memory dealloc bug?

I am at runlevel 3, with no monitor attached, and running CUDA and I experience the 5-7-second bug (in addition to another “terminate called after throwing an instance of ‘bool’” problem).