kernel runs fine under CUDA 1.0, fails under 1.1

System info:

  • Ubuntu Feisty
  • CUDA 1.0 and 1.1
  • SDK is not applicable
  • gcc-4.1
  • Dual-dual Opteron
  • 8 GB RAM
  • GeForce 8800 Ultra

The title of this topic (that a program that works under 1.0 doesn’t under 1.1) seems to me to be a very easy to describe symptom of a problem that I believe exists under 1.0 as well, under certain circumstances.

I’m not sure of the best way to exhibit the problem without disclosing all my source code, so I’ll describe the symptoms generally and then hopefully I can get some feedback on where to look or what further info to provide.

The symptom I observe first happened with I unrolled my innermost for-loop; the kernel then ran a lot faster (5x!) but then the output was garbage. That innermost loop was repeating the same function call 7 times, so the unrolling was trivial. When I comment out 5 of those calls, the whole kernel seems to at least be executing, but with 4 or fewer commented out, it doesn’t.

I replaced the function call with a macro and ensured that it wasn’t using any additional memory on each instantiation, but that didn’t help.

I tried using CUDA 1.1, but with that version, even the non-unrolled (rolled) for-loop version doesn’t run at all.

The emulator version works in all cases …

The only thing unusual that I [think I] am doing is using a lot of shared memory on all of the MPs: practically all 16kB.

Any input would be very helpful.

Thank you,
Glen Mabey

Just today I realized that CU_SAFE_CALL and friends don’t do anything unless you have def’ed _DEBUG … ouch.