Block + Thread parameters causing blue screens on windows

jungle · October 8, 2018, 4:12pm

I am getting blue screens with certain combinations of blocks and threads as kernel launch parameters.
The reason for the post is I am wondering if this is a normal thing to happen or does it mean the kernel has some deeper issue ?

Previously when I have launched kernels with different block and thread settings it has just affected performance and not stability.

Robert_Crovella · October 8, 2018, 4:30pm

You may be running into a WDDM TDR timeout. Just google that.

njuffa · October 8, 2018, 6:25pm

I am not saying it cannot happen, but I have never seen a WDDM timeout cause a blue screen, in dozens of instances of hitting Windows’ GUI watchdog timer timeout. With the timeout, the screen will go black for a few seconds, then the GUI desktop will be restored and a message will be shown that the driver recovered. Any GPU accelerated app (CUDA or OpenCL) will die in the process since their respective GPU contexts will be destroyed as the driver resets.

It is likely worthwhile noting the exact error message displayed when the blue screen occurs (such as report of a machine-check exception), as well as the exact hardware / software machine configuration. Remote diagnosis of blue screen failures is difficult.

I have not encountered a blue screen on Windows (or the equivalent “kernel panic” on Linux) in many years: at least five, possibly ten. These are events that are not supposed to happen and indicative of serious problems somewhere.

jungle · October 8, 2018, 8:05pm

Yes there are TDR notices popping up when it runs. I just read the WDDM timeout thread that is pinned in the forum.

I will try putting a thread sleep on the host and possibly a cudaSynchronize to see if that helps.

njuffa · October 8, 2018, 8:12pm

I take it term “blue screen” was used loosely then? If you are hitting the watchdog timer limit, here are your potential remedies:

(1) Reduce the run time of your kernels to less than ~2 seconds
(2) Increase (or disable) the watchdog timer limit (OS specific; Google will be your fried).
(3) Use a faster GPU

Item (1) may be addressable, up to a certain point, through software optimization. Use the CUDA profiler to identify the bottlenecks and address them as much as possible. The CUDA Best Practices Guide may be helpful in that.

saulocpp · October 9, 2018, 8:46am

Just my personal experience on TDR timeout, I raised it to something between 10 and 15 seconds because cuda-memcheck can take more than 2 seconds to do its job, even if the kernel runs much faster than this.
I’m yet to see a BSOD due to wrong CUDA code, and boy have I done a lot of sh**t…

jungle · October 9, 2018, 3:23pm

It is a full BSOD with the blue screen. What doesn’t make sense is that TDR should make the computer more ‘stable’ however when I switch it off the blue screens don’t occur.

So now I’m thinking that TDR resetting the graphics or blocking the application from the graphics hardware could be causing the BSOD.

If windows resets the GPUs does that mean my application now holds an invalid reference to the cudaDevice ?

Or is it possible that the application tried to access unprocessed data because the kernels fails to launch etc… ?

Just looking to get some ideas for things to investigate, any help is greatly appreciated !

saulocpp · October 9, 2018, 8:32pm

Did you increase the timeout value in Windows? There are videos showing how to do it.
Can you run the same program on another computer with a NVidia card to see if the same happens?
Are the kernel launch parameters within the limits of your device, like maximum number of threads?

Read Robert Crovella’s answer here: [url]https://devtalk.nvidia.com/default/topic/978550/cuda-programming-and-performance/maximum-number-of-threads-on-thread-block/[/url]
I have this in my bookmarks.

Topic		Replies	Views
CUDA Timeout? CUDA Programming and Performance	7	27690	December 19, 2011
CUDA program crashes PC - sometimes! CUDA Programming and Performance	5	735	January 4, 2019
CUDA Kernel Crash CUDA Programming and Performance	13	4631	January 8, 2018
Cuda application crashes occasionally + displays flicker (windows resets the GPU). gaining exclusive access? CUDA Programming and Performance	8	828	February 1, 2022
Crashes - display driver recovers Cuda program causes card to give up. CUDA Programming and Performance	4	3801	June 21, 2011
Bluescreen while running CUDA kernel CUDA Programming and Performance	5	7703	July 8, 2009
CUDA debugger does WDDM timeout at breakpoint CUDA Programming and Performance	6	1306	June 2, 2015
Too much threads makes computer crashing If this kernell takes a long time to complete, I got a blue CUDA Programming and Performance	7	2026	April 24, 2009
Needing expert advice.. CUDA Programming and Performance	4	1268	July 21, 2014
CUDA program causes NVDIA driver to crash, should I disable the TDR? Container: CUDA cuda	0	698	March 15, 2023

Block + Thread parameters causing blue screens on windows

Related topics