I recently discovered that I could run more than 65535 blocks at a time by using a second dimension, e.g.
dim3 grid(65535, 12);
This meant that I could run more blocks with less work for each thread (12x more blocks for 12x less work per thread)
When I had originally developed this program, I noticed that the graphics card would crash (driver successfully recover message) if I ran more than about 12 or 15 permuations per thread (512 threads per block), apparently because the thread would take to long, at least that is what people told me.
This is why I was so happy to discover I could run more blocks, so each thread should run for a shorter period of time, however I now get the same crash with using 12x more blocks instead of running each thread 12x longer. (I get the feeling the overall program is actually much less efficient now for various reasons).
For a look at the code, you can download it here: http://www.putfile2.com/f/1259/ujrndf
I put it into two folders so that you can see the code that worked and the code that doesn’t. There maybe just a logical error that is causing problems (or many errors combined).
I am hoping that by being able to run many more blocks, I should be able to push past 12 node TSP to 13+.
Any help appreciated.