Crashes - display driver recovers Cuda program causes card to give up.

Stu2000 · June 20, 2011, 9:03pm

Hello all,
I recently discovered that I could run more than 65535 blocks at a time by using a second dimension, e.g.

dim3 grid(65535, 12);
TestKernel<<<grid, 1>>>(dev_testArray);

instead of
TestKernel<<<65535, 1>>>(dev_testArray);

This meant that I could run more blocks with less work for each thread (12x more blocks for 12x less work per thread)
When I had originally developed this program, I noticed that the graphics card would crash (driver successfully recover message) if I ran more than about 12 or 15 permuations per thread (512 threads per block), apparently because the thread would take to long, at least that is what people told me.
This is why I was so happy to discover I could run more blocks, so each thread should run for a shorter period of time, however I now get the same crash with using 12x more blocks instead of running each thread 12x longer. (I get the feeling the overall program is actually much less efficient now for various reasons).

For a look at the code, you can download it here: http://www.putfile2.com/f/1259/ujrndf
I put it into two folders so that you can see the code that worked and the code that doesn’t. There maybe just a logical error that is causing problems (or many errors combined).
I am hoping that by being able to run many more blocks, I should be able to push past 12 node TSP to 13+.

Any help appreciated.

Stu

tera · June 21, 2011, 12:00am

The watchdog timer kicks in if a kernel takes too long for a full grid, not for an individual thread. So it does not matter how you divide the work between threads and blocks.

An easy way to run more blocks without triggering the watchdog is to divide them between multiple kernel launches (i.e., instead of launching one grid of 65535Ã—3 blocks, launch three grids of 65535 blocks each).

Skybuck · June 21, 2011, 5:51am

Yeah you could try disabling the watchdog External Image :)

Stu2000 · June 21, 2011, 8:39am

Ooooh i really like this idea.

I found this thread about the timeout issue and they mentioned the ‘no monitor’ plugged in concept, which I have already tried. I have a crappy ati + nvidia gtx260 so i can still see.

Can someone confirm whether I need to manually disable the windows driver thing or whether having no monitors plugged in should be good enough.

Does it have to be a tesla gfx card?
the following screenshots show that i have tried without any monitors plugged into the gfx card, and the error message.
http://www.uploadscreenshot.com/image/361307/990296
http://www.uploadscreenshot.com/image/361306/8515921

Ideally I would like to be able to run a cuda application or kernel for more than 15 seconds, probably more like a 1+ minutes…

I will try splitting it up over several kernels for now thank you.

Best wishes,

Stu

Stu2000 · June 21, 2011, 9:37am

I have since put some more time into this and followed the instructions on this website:
http://www.blog-gpgpu.com/index.php?post/2010/07/22/Windows-Vista-7-%3A-How-to-disable-the-Timeout-Detection-and-Recovery-of-GPUs-through-WDDM

it works! it turns out that it was indeed an issue with the watchdog and unplugging the monitors from that card is not enough and you do have to change the win7 registry no matter what.

What I found odd was that the old program that works with the watchdog on is actually a fraction slower (0.45% according to the compute visual profiler). In that program the threads ran for 15x longer instead of having 15x as many blocks. I had thought that running the threads for a longer time, as I am already maxing out the blocks/threads that can run on the gfx card, would be if anything faster, not slower.

Anyways problem solved. Thanks guys.

Topic		Replies	Views
Problems caused by doing very intesive calculations generating 479001600 of permutations and calcula CUDA Programming and Performance	5	1073	March 5, 2011
Cuda timeout and crash CUDA Programming and Performance	1	905	July 17, 2009
Block + Thread parameters causing blue screens on windows CUDA Programming and Performance	7	944	October 9, 2018
Bluescreen while running CUDA kernel CUDA Programming and Performance	5	7703	July 8, 2009
Running cuda code on many devices, driver crashes CUDA Programming and Performance	3	1026	January 5, 2017
Too much threads makes computer crashing If this kernell takes a long time to complete, I got a blue CUDA Programming and Performance	7	2026	April 24, 2009
Driver Crash on TitanX during kernel operation. CUDA Programming and Performance	6	1351	June 23, 2015
GeForce GTX980 time out issue CUDA Programming and Performance	9	4985	January 28, 2015
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1786	November 4, 2016
CUDA Vista "Display driver has stopped responding" CUDA execution time on Vista CUDA Programming and Performance	1	8335	September 15, 2008

Crashes - display driver recovers Cuda program causes card to give up.

Related topics