Rest crashed CUDA card

I’ve an XP 64bit system with a Tesla C1060 card. The latest version of my code crashes the card after the principal kernel call with a launch timeout. My bad, I will sort it out, however I do call __syncthreads before leaving the kernel and cannot work out what is taking the time. The next call in the main execution is to cudaGLUnmapBufferObject which fails with the launch timeout courtesy of the CUDA_SAFE_CALL macro. Any subsequent kernel call crashes the card and freezes the PC and I have to reboot.

I have tried reinitializing the card (well setting cudaSetDevice to the card at any rate) and then calling cudaThreadExit Neither reports any errors but the card is still frozen. Running a kernel call again freezes the PC.

Is there any way to reset the card other than rebooting?

Thanks,
John

__syncthreads() is NOT needed before leaving the kernel.

Kindly give information about the CUDA revision, Driver revision that you are using. If you are not using the latest, try using the latest (2.1) and see if your problem goes away

Thanks for the tip on __syncthreads();

I found the problem that caused the crash, I was overruning an array. Would still like to if it is possible to reset the card using software rather than a reset.

Since there is NO VM (Virtual Memory) on CUDA, there is no way to stop a rogue kernel. So, if your CUDA program goes out of control and starts destroying other areas of device-memory there is little that can be done.

Possible that buggy CUDA kernel is destroying sensitive piece of information stored in other areas of device memory causing the crash. But storing such sensitive piece of information in device memory may NOT be a great idea (probably the NV Driver does… and hence pays for it)!

Try using Device manager. Stop the graphics card - which may only work if you have another graphics chip (say, onboard graphics) to take run the primary display.

Then re-enable the device and see whether that implicitly resets CUDA.

Christian

Good idea, but Device Manager requires a restart when enabling or disabling the card! External Image