Kernel launches fail moving to fermi?

mlohry · June 24, 2010, 5:59pm

I just put a GTX470 into a box that was running a GTX260 with no issues (ubuntu x64 10.04 with gcc4.3, latest drivers and 3.0 tools. mostly identical box is now running the gtx260 without issue.) Rebuilt program, and nothing works! Looking at the profiler shows none of the kernels being launched on the 470, only the initial memcpy operations. cudaGetLastError() is showing no error throughout execution. The 470 is correctly recognized by my program, and all the SDK examples run fine. Both boxes are linking to the same libraries, so I don’t think that’s the problem.

I must be overlooking something painfully obvious. Any ideas? I was fully expecting to be able to just drop in the 470 and run without issue, so you could imagine my panic!

tmurray · June 24, 2010, 6:12pm

Did you build with -arch sm_20?

mlohry · June 24, 2010, 6:13pm

Yes, sorry forgot to mention that. Tried that and this from the compatibility guide:

-gencode=arch=compute_10,code=sm_10
-gencode=arch=compute_10,code=compute_10
-gencode=arch=compute_20,code=sm_20
-gencode=arch=compute_20,code=compute_20

mfatica · June 24, 2010, 6:40pm

Check your shared memory access pattern, if you are going out of bound it will fail to launch on Fermi ( but it will work fine on previous generations).

mlohry · June 24, 2010, 6:53pm

I’ll doublecheck that stuff.

But my shared memory in kernel calls is based on a static problem size in all cases, and the 470 has 4x the shared memory of the 260 per SM, so anything that worked on it should work on the 470 yes? And wouldn’t cudaGetLastError return something in the event of over-using shared mem?

tmurray · June 24, 2010, 7:02pm

Do you call cudaGetLastError after every kernel launch before you do any other calls? If the kernel fails to launch at all (instead of crashing) for some reason, that’s the only way you’ll see an error.

mlohry · June 24, 2010, 7:53pm

looks like there’s an “unspecified launch error” propagating through, but only on the second iteration, after every kernel has been called at least once without error External Media … although not even consistently, re-running on only 2 iterations doesn’t always produce an error, but 3 will … !

i’ll go back through and put in proper error capturing like i should’ve to begin with and see what comes up, thanks.

seibert · June 24, 2010, 8:24pm

He’s referring to the better detection of shared memory access violations. Running off the end of a shared memory array has always had undefined behavior (i.e. “bad”), but before would not trigger a kernel failure. On Fermi it does. It’s not about how much shared memory you request, so much as accidentally accessing addresses you should not.

mlohry · June 24, 2010, 8:54pm

Ha! This was it. I should be very glad Fermi complains about that, as I found a tiny bug that never showed up on the 260 (guess that means I have to re-run some cases!) I had a piece of code in the transfer from global to shared memory like

array[ shared mem index ] = array[ global mem index ]

that should have been

array[ shared mem index ] = d_array[ global mem index ]

Greatly appreciated, guys. Always PEBKAC error :">