CUDA program crashes PC - sometimes!

Dear all

I’m having some trouble with a CUDA program I’ve written. I’m getting some strange behaviour from my CUDA program. It compiles, and sometimes it runs, but most of the time it causes the PC to trash, giving a black screen, and graphics fan turns immediately to high. I have to do a hard power-off and even when the PC is turned on again, it will not run any 3D application like a game. Starting a game will result in the same thing, black screen, and fan goes immediately to high. I got functionality back by moving the card to another PCIe port in the motherboard (and then later back again to the original). Simply removing driver and reinstall was not enough.

I worry if my program is going over some hardware limitations, causing these crashes. I launch a kernel with 1440 blocks of 64 threads, so
myKernelOS <<< 1440, 64 >>> (arguments)

I have checked the ptxas info when compiling and if I add up the values it writes im using 176 registers and about 1200bytes of cmem. The card im using is a Geforce RTX 2080. So does this exceed any hardware limitations?

You may be hitting a WDDM TDR timeout. Just google that or read this:

https://docs.nvidia.com/gameworks/content/developertools/desktop/timeout_detection_recovery.htm

Thank you for the suggestion. But I have already disabled the TDR timeout.

Checking again the ptxas info, I tried out-commenting parts of the kernel code that calls another
device function, which I would think should reduce the number of registers used considerably, but instead the number of registers used went up to 188 from 176…
Does this tool properly count the number of registers used including other device function calls from inside the kernel code?
Edit: Nvm, this code was also called elsewhere in the kernel. When outcommenting all calls to this device function the number of used registers does go down considerably.

It’s possible that you didn’t disable TDR correctly.

And if you have completely disabled TDR, and your CUDA kernel runs for a long time, Windows may still become unstable. There is no solution for this when running CUDA codes on a WDDM GPU. The correct approach is to design your codes to have relatively short kernel runtimes (e.g. 0.1 second or less) so that windows GUI operation is not adversely affected. You should understand, based on the design of your code, how to use a profiler to measure kernel runtimes and determine the factors (such as launch configuration) which affect kernel runtimes.

Codes running on linux where the X GUI is not using the GPU, or on windows in TCC mode, don’t have these issues. Your RTX 2080 cannot be placed into TCC mode and will always be in WDDM mode on windows.

The register usage and cmem usage don’t exceed any hardware limitations.

You should start by measuring your actual kernel runtimes. Also, it’s generally a good idea to be building a release project, rather than a debug project in windows, unless you are actively debugging code.

Alright, thanks Robert.
I am getting a Titan V card, and the running on my RTX 2080 was only as preparation. I’m hoping the issue will be resolved by then running it in windows TCC mode.