"dead" reset cuda device when debugging

little_jimmy · March 26, 2014, 7:35am

Hello,

Is there a way to completely and entirely reset a cuda device in software?

cudaDeviceReset() is not entirely managing

I suspect my program has a major bug still, and I am still debugging

I am running multiple kernels in multiple streams, so although I am stepping my program, at this point it is difficult (impossible) to prevent a (major) crash

Now, when I exit debugging (for fixes) and later re-start debugging, I am unable to again properly use the cuda device (I can’t re-enter my program and again step it post using the device again), until I either reboot, or let the host sleep for a second, and then starting it up again - both cases of course switch the cuda device off, thereby resetting it

linux fedora 20
geforce gtx 780ti

geoffg · March 31, 2014, 6:09pm

Hi,

Would you be able to post the code, and the steps that reproduce this problem?

When you are letting the host sleep for a second, are you able to check if any of your processes are still running?

Regarding cudaDeviceReset(), the API will destroy the primary context that the calling host thread is operating on, and allows for outstanding buffers to flush; however, it does not guarantee that the GPU fully recovers in all situations as you have observed here.

little_jimmy · April 1, 2014, 6:53am

“Would you be able to post the code, and the steps that reproduce this problem?”

The project containing all the source is spread over more than 10k lines of code

When debugging, I can note when the program is misbehaving and going off-tracks (mostly by noting execution times), and can then terminate it; the cuda view in debugger would then generally show no more cuda kernels, etc running
But when I change code and then again immediately debug without sleeping/ rebooting, the execution path at the point of and around cuda kernel calls are different (the program execution path is determined by cuda kernel results/ outcomes, and outcomes that initially succeeded, would then fail, for instance)

It MIGHT be because, at that time, I ran cuda5-5 on fedora 20 with gcc 4.8, but I am not altogether convinced, given that debugging itself is not truly erratic - it seems more “device-side” than “host-side”
I have since moved to cuda-6
(I installed fedora prior to cuda, and can not get myself to move from fedora 20 to 18/ 19)
(I have also made progress in terms of debugging, hence crashes are less severe; and there were indeed serious bugs)

What I am equally considering at this point is that my cuda code sections are too long by far
I need to perform numerous steps/ calculations at a time, and at this point, instead of implementing it as a set of (relatively small (less than 100 or 200 lines of code)) kernels, I have implemented it as a single kernel with inlined functions; the latter approach has less overhead, but perhaps it is not the best practical approach; I am not sure how the device stores/ buffers code to execute and whether the device actually has a “program/ code memory” constraint (memory used to store code on the device/ SM when scheduling work for wasp blocks)
Some indications suggest that this is becoming an issue…

little_jimmy · April 1, 2014, 7:36am

For the sake of clarity, maybe I should elaborate a bit more

The major bugs I eventually found were the use of local memory as shared memory - perceiving and using memory as shared, but forgetting to explicitly declaring it as such
The majority of my memory was declared correctly, but isolated instances of improper declaration was enough to cause poor values to be used, seriously offsetting the program execution
It seems as if the host and its alliances lost full control of the device under these circumstances, unable to fully recover the device for subsequent use, without sleeping/ rebooting

With regards to the program and its structuring - the program does span more than 10k lines of code in its entirety; the majority of this being cuda code
However:
Input data is read from a database by the host, consists of arrays, and on average spans about 10 - 100 KB
Input data array length can vary significantly, and the program uses 4 branches - 4 algorithm variants - to adopt to this
The variant I debugged when I experienced the ‘proper reset issue’ is contained in a cuda section that spans no more than 1500 lines of code, contained in no more than 10 functions, mostly inlined in a single container kernel

REPoore · April 3, 2014, 7:41pm

On Linux, I’ve had good luck doing “sudo /sbin/rmmod nvidia.ko” when my GPU state gets messed up. This will delete the Nvidia driver. The next time a Cuda program wants to run, the driver will get restarted.

Topic		Replies	Views
Computation crash = stuck at 574mhz CUDA Programming and Performance	9	1277	August 4, 2015
GPU processing does not give full power after hibernation CUDA Programming and Performance kernel	2	766	May 9, 2023
How to recover CUDA after the display driver has crashed and recovered(caused by cuda crash)? CUDA Programming and Performance	7	1520	October 23, 2014
Problem with cudaGetDeviceCount returned 802 error Linux cuda	6	2660	December 28, 2024
cuda-gdb error CUDA Setup and Installation	15	3134	September 12, 2019
cuda-gdb hang and compiled program spewing nonsense CUDA Programming and Performance	7	2248	February 15, 2011
cuda-gdb hangs CUDA-GDB	12	8403	May 23, 2014
Debugging device code does not work CUDA Programming and Performance	7	2888	July 11, 2013
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	53724	August 1, 2011
cuda-gdb crashes and device printf() CUDA Programming and Performance	5	2282	December 23, 2010

"dead" reset cuda device when debugging

Related topics