I’ve an XP 64bit system with a Tesla C1060 card. The latest version of my code crashes the card after the principal kernel call with a launch timeout. My bad, I will sort it out, however I do call __syncthreads before leaving the kernel and cannot work out what is taking the time. The next call in the main execution is to cudaGLUnmapBufferObject which fails with the launch timeout courtesy of the CUDA_SAFE_CALL macro. Any subsequent kernel call crashes the card and freezes the PC and I have to reboot.
I have tried reinitializing the card (well setting cudaSetDevice to the card at any rate) and then calling cudaThreadExit Neither reports any errors but the card is still frozen. Running a kernel call again freezes the PC.
Is there any way to reset the card other than rebooting?
Since there is NO VM (Virtual Memory) on CUDA, there is no way to stop a rogue kernel. So, if your CUDA program goes out of control and starts destroying other areas of device-memory there is little that can be done.
Possible that buggy CUDA kernel is destroying sensitive piece of information stored in other areas of device memory causing the crash. But storing such sensitive piece of information in device memory may NOT be a great idea (probably the NV Driver does… and hence pays for it)!