instead of
TestKernel<<<65535, 1>>>(dev_testArray);
This meant that I could run more blocks with less work for each thread (12x more blocks for 12x less work per thread)
When I had originally developed this program, I noticed that the graphics card would crash (driver successfully recover message) if I ran more than about 12 or 15 permuations per thread (512 threads per block), apparently because the thread would take to long, at least that is what people told me.
This is why I was so happy to discover I could run more blocks, so each thread should run for a shorter period of time, however I now get the same crash with using 12x more blocks instead of running each thread 12x longer. (I get the feeling the overall program is actually much less efficient now for various reasons).
For a look at the code, you can download it here: http://www.putfile2.com/f/1259/ujrndf
I put it into two folders so that you can see the code that worked and the code that doesn’t. There maybe just a logical error that is causing problems (or many errors combined).
I am hoping that by being able to run many more blocks, I should be able to push past 12 node TSP to 13+.
The watchdog timer kicks in if a kernel takes too long for a full grid, not for an individual thread. So it does not matter how you divide the work between threads and blocks.
An easy way to run more blocks without triggering the watchdog is to divide them between multiple kernel launches (i.e., instead of launching one grid of 65535×3 blocks, launch three grids of 65535 blocks each).
I found this thread about the timeout issue and they mentioned the ‘no monitor’ plugged in concept, which I have already tried. I have a crappy ati + nvidia gtx260 so i can still see.
Can someone confirm whether I need to manually disable the windows driver thing or whether having no monitors plugged in should be good enough.
Does it have to be a tesla gfx card?
the following screenshots show that i have tried without any monitors plugged into the gfx card, and the error message.
it works! it turns out that it was indeed an issue with the watchdog and unplugging the monitors from that card is not enough and you do have to change the win7 registry no matter what.
What I found odd was that the old program that works with the watchdog on is actually a fraction slower (0.45% according to the compute visual profiler). In that program the threads ran for 15x longer instead of having 15x as many blocks. I had thought that running the threads for a longer time, as I am already maxing out the blocks/threads that can run on the gfx card, would be if anything faster, not slower.