I am getting blue screens with certain combinations of blocks and threads as kernel launch parameters.
The reason for the post is I am wondering if this is a normal thing to happen or does it mean the kernel has some deeper issue ?
Previously when I have launched kernels with different block and thread settings it has just affected performance and not stability.
You may be running into a WDDM TDR timeout. Just google that.
I am not saying it cannot happen, but I have never seen a WDDM timeout cause a blue screen, in dozens of instances of hitting Windows’ GUI watchdog timer timeout. With the timeout, the screen will go black for a few seconds, then the GUI desktop will be restored and a message will be shown that the driver recovered. Any GPU accelerated app (CUDA or OpenCL) will die in the process since their respective GPU contexts will be destroyed as the driver resets.
It is likely worthwhile noting the exact error message displayed when the blue screen occurs (such as report of a machine-check exception), as well as the exact hardware / software machine configuration. Remote diagnosis of blue screen failures is difficult.
I have not encountered a blue screen on Windows (or the equivalent “kernel panic” on Linux) in many years: at least five, possibly ten. These are events that are not supposed to happen and indicative of serious problems somewhere.
Yes there are TDR notices popping up when it runs. I just read the WDDM timeout thread that is pinned in the forum.
I will try putting a thread sleep on the host and possibly a cudaSynchronize to see if that helps.
I take it term “blue screen” was used loosely then? If you are hitting the watchdog timer limit, here are your potential remedies:
(1) Reduce the run time of your kernels to less than ~2 seconds
(2) Increase (or disable) the watchdog timer limit (OS specific; Google will be your fried).
(3) Use a faster GPU
Item (1) may be addressable, up to a certain point, through software optimization. Use the CUDA profiler to identify the bottlenecks and address them as much as possible. The CUDA Best Practices Guide may be helpful in that.
Just my personal experience on TDR timeout, I raised it to something between 10 and 15 seconds because cuda-memcheck can take more than 2 seconds to do its job, even if the kernel runs much faster than this.
I’m yet to see a BSOD due to wrong CUDA code, and boy have I done a lot of sh**t…
It is a full BSOD with the blue screen. What doesn’t make sense is that TDR should make the computer more ‘stable’ however when I switch it off the blue screens don’t occur.
So now I’m thinking that TDR resetting the graphics or blocking the application from the graphics hardware could be causing the BSOD.
If windows resets the GPUs does that mean my application now holds an invalid reference to the cudaDevice ?
Or is it possible that the application tried to access unprocessed data because the kernels fails to launch etc… ?
Just looking to get some ideas for things to investigate, any help is greatly appreciated !
Did you increase the timeout value in Windows? There are videos showing how to do it.
Can you run the same program on another computer with a NVidia card to see if the same happens?
Are the kernel launch parameters within the limits of your device, like maximum number of threads?
Read Robert Crovella’s answer here: https://devtalk.nvidia.com/default/topic/978550/cuda-programming-and-performance/maximum-number-of-threads-on-thread-block/
I have this in my bookmarks.