EmuDebug / Debug profile behaviors

Should I be under the assumption that any difference in behavior between EmuDebug and Debug compilation profiles is a compiler bug?

I’ve been getting extremely frustrated trying to debug a kernel I’ve been working on that runs in EmuDebug with absolutely no problems, no exceptions, no memory exceptions, but in regular Debug mode which uses the hardware, I get billions of these in the output log:

First-chance exception at 0x7c812a5b in cudatest.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012fae0…

and the kernel is obviously aborting early because it is finishing much quicker than it normally would. What’s the point of even having a “debug” profile if you can’t use it to debug anything, and what’s the point of the EmuDebug profile if it isn’t even going to mimic the runtime behavior on the hardware?

FYI, the very worst part of it is I’ve narrowed it down to this:

		p = make_float3(p0.x * bv.x + p1.x * bv.y + p2.x * bv.z,

					    p0.y * bv.x + p1.y * bv.y + p2.y * bv.z,

					    p0.z * bv.x + p1.z * bv.y + p2.z * bv.z);

if I comment THAT out, no errors, everything runs as it should in Debug mode. Anyone see any ways that code can cause an exception? Yeah, I didn’t think so. Those are all initialized variables.

Nevermind, I got it. Turns out this is extremely descriptive error you will get if the compiler can’t figure out (because it is retarded, mind you) how to allocate storage for all your local variables, either to registers or to local memory, the latter of which is extremely plentiful.

I didn’t really understand what the problem might be until I looked at the ptx intermediate code. Hey Nvidia, have you guys heard of register renaming? It’s a really cool idea some guy had back in the day, you should look it up some time.

In the conversion from ptx to cubin register re-use is performed.

Okay, great, but .cubin files aren’t human readable. How am I supposed to figure out exactly what the compiler is doing to make optimizations if I can’t even see how it’s allocating precious registers?

I apologize for my attitude, I’ve spent a long time porting a complex piece of code and I’m really feeling frustrated with CUDA in general at this point.

You can use decuda to dissassemble the cubin if you want but that usually isn’t necessary. The most information you usually need from the cubin is to read the number of registers and shmem usage so that you can determine the maximum block size you can run.

Thanks. NVidia could probably save a lot of people a lot of trouble by documenting all of this stuff, or if it is documented, by making it a little less hard to find.

+1

what is the difference between EmuDebug and Debug? Why do we want to use one over the other?

Ha. I bet you still didn’t find out about the SDK error-checking macros. Then you wouldn’t see an uncaught exception, but an uninformative error message.

Anyway, I think you’re probably just running out of registers. CUDA’s got many problems, but I don’t think you’ve discovered the correct ones yet.