Kernel launches fail moving to fermi?

I just put a GTX470 into a box that was running a GTX260 with no issues (ubuntu x64 10.04 with gcc4.3, latest drivers and 3.0 tools. mostly identical box is now running the gtx260 without issue.) Rebuilt program, and nothing works! Looking at the profiler shows none of the kernels being launched on the 470, only the initial memcpy operations. cudaGetLastError() is showing no error throughout execution. The 470 is correctly recognized by my program, and all the SDK examples run fine. Both boxes are linking to the same libraries, so I don’t think that’s the problem.

I must be overlooking something painfully obvious. Any ideas? I was fully expecting to be able to just drop in the 470 and run without issue, so you could imagine my panic!

Did you build with -arch sm_20?

Yes, sorry forgot to mention that. Tried that and this from the compatibility guide:

-gencode=arch=compute_10,code=sm_10
-gencode=arch=compute_10,code=compute_10
-gencode=arch=compute_20,code=sm_20
-gencode=arch=compute_20,code=compute_20

Check your shared memory access pattern, if you are going out of bound it will fail to launch on Fermi ( but it will work fine on previous generations).

I’ll doublecheck that stuff.

But my shared memory in kernel calls is based on a static problem size in all cases, and the 470 has 4x the shared memory of the 260 per SM, so anything that worked on it should work on the 470 yes? And wouldn’t cudaGetLastError return something in the event of over-using shared mem?

Do you call cudaGetLastError after every kernel launch before you do any other calls? If the kernel fails to launch at all (instead of crashing) for some reason, that’s the only way you’ll see an error.

looks like there’s an “unspecified launch error” propagating through, but only on the second iteration, after every kernel has been called at least once without error External Media … although not even consistently, re-running on only 2 iterations doesn’t always produce an error, but 3 will … !

i’ll go back through and put in proper error capturing like i should’ve to begin with and see what comes up, thanks.

He’s referring to the better detection of shared memory access violations. Running off the end of a shared memory array has always had undefined behavior (i.e. “bad”), but before would not trigger a kernel failure. On Fermi it does. It’s not about how much shared memory you request, so much as accidentally accessing addresses you should not.

Ha! This was it. I should be very glad Fermi complains about that, as I found a tiny bug that never showed up on the 260 (guess that means I have to re-run some cases!) I had a piece of code in the transfer from global to shared memory like

array[ shared mem index ] = array[ global mem index ]

that should have been

array[ shared mem index ] = d_array[ global mem index ]

Greatly appreciated, guys. Always PEBKAC error :">