problems with cuda on linux

amariano · March 13, 2007, 3:55pm

Hi everyone.

I’m experiencing some problems with cuda on linux (Fedora 6).

I’m using the version 0.8 of cuda and NVIDIA-Linux-x86-1.0-9751 for the drivers.

The result is:

This problem occures only if I really load the device. For small amount of data everything is alright.

Any suggestion?

Thank!

Stuart_Johnson · March 16, 2007, 10:17pm

I am having the same problem with a Dell Performance 690, using the same software versions you mention. Anytime code runs on the GPU for more than about 7.5 seconds, the cuda call returns prematurely and the following error is emitted to the system log:

NVRM: Xid (000a:00): 8, Channel 00000001

Depending on the code following the GPU call, I may get:

terminate called after throwing an instance of ‘bool’
Aborted

from my code…

tachyon_john · March 16, 2007, 10:23pm

This is a documented issue if you read the driver release notes. I believe the solution is to either run on a GPU that’s not being managed by X. I myself just changed my kernels so that no single kernel

runs for more than a few seconds at a time, which was actually better for me anyway as it made my code easier to multithread for multiple GPUs later on…

John

Stuart_Johnson · March 16, 2007, 11:07pm

Looks to me like the Driver README.txt says:

Q. My kernel log contains messages that are prefixed with “Xid”; what do these
messages mean?

A. “Xid” messages indicate that a general GPU error occurred, most often due
to the driver misprogramming the GPU or to corruption of the commands sent
to the GPU. These messages provide diagnostic information that can be used
by NVIDIA to aid in debugging reported problems.

I use X on this machine, if at all, only through a non-8800 card.

Like you say, the current solution is to run smaller amounts of work per chunk…

Any windows or linux folks running more than 7 or 8 seconds of happy computing on the GPU in a single cuda call?

tachyon_john · March 17, 2007, 5:01am

Sorry, the note I was referring to was actually in the CUDA release notes, not the driver. It says:

o Individual GPU program launches are limited to a run time

of less than 5 seconds on the device. Exceeding this time

limit usually causes a launch failure reported through the

CUDA driver or the CUDA runtime, but in some cases hangs the

entire machine, requiring a hard reset. For this reason it

is recommeded that CUDA is run on a G80 that is NOT attached

to an X display.

o While X does not need to be running in order to use CUDA,

X must have been initialized at least once after booting

in order to properly load the NVIDIA kernel module. The

NVIDIA kernel module remains loaded even after X shuts down,

allowing CUDA to continue to function.

Hope that helps.

John

Stuart_Johnson · March 17, 2007, 5:48am

Google suggests that Ye olde NVRM: Xid errors may be related to a larger context of issues… and I believe the system is convinced that the 8800 is not associated with an X display.

It may be useful information to know if the windows software stack on the same box runs fine. I might give this a whirl at some point. I’m not sure having a windows box around with one or more 8800s is conducive to productivity, however :D .

eceflyboy · March 27, 2007, 9:32pm

We are avoiding the time out problem by invoking our kernel to run smaller data chunks, each taking abut 2-4 seconds. However, after running like this for 12 minutes for a very large data set, we get an “unspecified driver error”. Has anyone seen problems like this yet, or work on number crunching that takes as long as 12 minutes?

tachyon_john · March 27, 2007, 9:51pm

Hi,

We’re running some huge averaged Coulombic potential jobs on 3 GPUs at a time for a few hours. Ultimately they currently end up crashing eventually because CUDA appears to have a memory deallocation bug that occurs over a long period of time when thousands of kernels have been run. Eventually, no kernels will run anymore and the machine has to be rebooted. Until the progressive memory leak runs the cards out of memory (usually takes about 2 days) they run fine. Each kernel invocation only ends up running for about 3 seconds, but we keep them cranking along until one of these issues I’ve described occurs, requiring a reboot.

Cheers,

John Stone

seibert · March 27, 2007, 11:52pm

I’ve had the card in use for nearly 4 days solid, but that was divided into separate jobs which would run for 6-12 hours and exit. Within each job, the kernel calls were very short (tens of miliseconds), but there would be 6 million calls per job. So far I have not been able to jam anything up, even when I abort the jobs in the middle of running.

Mark_Harris · March 28, 2007, 5:47pm

John, have you submitted a bug report on this on the registered developer site?

Thanks,
Mark

tachyon_john · March 28, 2007, 7:18pm

Mark,

Yes, I actually have two bugs filed. One for the slowly occuring leak that occurs over several days which requires a reboot to cure, and another for what appears to be a bug with cudaFree not deallocating blocks of memory from within child host threads for our multi-GPU runs. I’m working on making a simpler test case so you guys can reproduce it more easily.

John

diofant · April 16, 2007, 12:05pm

This is really annoying.

Is it a bug that will be fixed?

Or are there technical reasons that make this unavoidable?

Or are there political reasons, i.e. will a production version of CUDA without this restriction cost money? This should be known before spending money on a G80 card.

Regards

Markus

Mark_Harris · April 19, 2007, 10:55am

The system hang bug will be fixed. The time limit is avoidable by using a non-display GPU for CUDA computations.

Mark

mstock · May 16, 2007, 3:43pm

Which time limit? The 5 seconds per kernel invocation, eceflyboy’s 12-minute unspecified driver error, or John’s several-hour memory dealloc bug?

I am at runlevel 3, with no monitor attached, and running CUDA and I experience the 5-7-second bug (in addition to another “terminate called after throwing an instance of ‘bool’” problem).

Topic		Replies	Views
Deciphering an NVRM: Xid message? CUDA Programming and Performance	27	78083	April 1, 2012
Launch Timeouts CUDA Programming and Performance	32	21799	May 4, 2011
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12717	July 26, 2010
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20147	May 4, 2007
5 seconds limitation? or a bug in my kernel? CUDA Programming and Performance	2	2332	October 17, 2007
Install Problem CUDA Programming and Performance	32	12706	December 17, 2009
unspecified launch failure kernel fails if a loop is too long CUDA Programming and Performance	8	42841	April 25, 2007
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1904	January 12, 2019
Cuda Error #4 that requires PC Reboot, Help!!! CUDA Programming and Performance	17	9579	September 17, 2013
Deminishing performance? CUDA Programming and Performance	29	13083	March 5, 2009

problems with cuda on linux

Related topics