"dead" reset cuda device when debugging

Hello,

Is there a way to completely and entirely reset a cuda device in software?

cudaDeviceReset() is not entirely managing

I suspect my program has a major bug still, and I am still debugging

I am running multiple kernels in multiple streams, so although I am stepping my program, at this point it is difficult (impossible) to prevent a (major) crash

Now, when I exit debugging (for fixes) and later re-start debugging, I am unable to again properly use the cuda device (I can’t re-enter my program and again step it post using the device again), until I either reboot, or let the host sleep for a second, and then starting it up again - both cases of course switch the cuda device off, thereby resetting it

linux fedora 20
geforce gtx 780ti

Hi,

Would you be able to post the code, and the steps that reproduce this problem?

When you are letting the host sleep for a second, are you able to check if any of your processes are still running?

Regarding cudaDeviceReset(), the API will destroy the primary context that the calling host thread is operating on, and allows for outstanding buffers to flush; however, it does not guarantee that the GPU fully recovers in all situations as you have observed here.

“Would you be able to post the code, and the steps that reproduce this problem?”

The project containing all the source is spread over more than 10k lines of code

When debugging, I can note when the program is misbehaving and going off-tracks (mostly by noting execution times), and can then terminate it; the cuda view in debugger would then generally show no more cuda kernels, etc running
But when I change code and then again immediately debug without sleeping/ rebooting, the execution path at the point of and around cuda kernel calls are different (the program execution path is determined by cuda kernel results/ outcomes, and outcomes that initially succeeded, would then fail, for instance)

It MIGHT be because, at that time, I ran cuda5-5 on fedora 20 with gcc 4.8, but I am not altogether convinced, given that debugging itself is not truly erratic - it seems more “device-side” than “host-side”
I have since moved to cuda-6
(I installed fedora prior to cuda, and can not get myself to move from fedora 20 to 18/ 19)
(I have also made progress in terms of debugging, hence crashes are less severe; and there were indeed serious bugs)

What I am equally considering at this point is that my cuda code sections are too long by far
I need to perform numerous steps/ calculations at a time, and at this point, instead of implementing it as a set of (relatively small (less than 100 or 200 lines of code)) kernels, I have implemented it as a single kernel with inlined functions; the latter approach has less overhead, but perhaps it is not the best practical approach; I am not sure how the device stores/ buffers code to execute and whether the device actually has a “program/ code memory” constraint (memory used to store code on the device/ SM when scheduling work for wasp blocks)
Some indications suggest that this is becoming an issue…

For the sake of clarity, maybe I should elaborate a bit more

The major bugs I eventually found were the use of local memory as shared memory - perceiving and using memory as shared, but forgetting to explicitly declaring it as such
The majority of my memory was declared correctly, but isolated instances of improper declaration was enough to cause poor values to be used, seriously offsetting the program execution
It seems as if the host and its alliances lost full control of the device under these circumstances, unable to fully recover the device for subsequent use, without sleeping/ rebooting

With regards to the program and its structuring - the program does span more than 10k lines of code in its entirety; the majority of this being cuda code
However:
Input data is read from a database by the host, consists of arrays, and on average spans about 10 - 100 KB
Input data array length can vary significantly, and the program uses 4 branches - 4 algorithm variants - to adopt to this
The variant I debugged when I experienced the ‘proper reset issue’ is contained in a cuda section that spans no more than 1500 lines of code, contained in no more than 10 functions, mostly inlined in a single container kernel

On Linux, I’ve had good luck doing “sudo /sbin/rmmod nvidia.ko” when my GPU state gets messed up. This will delete the Nvidia driver. The next time a Cuda program wants to run, the driver will get restarted.