Wonder if SASS corresponds to the real binary that will be run on GPU without further more optimizat

Hi all, I’m just curious about SASS and what really runs on GPU.
I used cuobjdump to extract the SASS code of my NVCC-compiled program, and I saw the codes at which I was surprised.
My environment is CUDA 5.5, sm_30 (GTX 650Ti BOOST)

For example,

/00d0/ MOV R0, R0; /* 0x2800000000001de4 */

is a typical anomaly in my point of view, because I think it is a dead statement which must be eliminated by the compiler.

Now here comes the question. I esteem NVIDIA, and I think NVCC isn’t too silly to eliminate this. But I doubt that there is further more optimizations to eliminate this. The moving of kernel from host to device, for instance, is a likely moment when the host could easily eliminate dead codes one by one.

Is this kind of optimizations really exist, or I could believe in SASS to be the real binary code?

Thank you!

Shunning Jiang


(1) Could you show complete, compilable source code that leads to this SASS ?
(2) Is this a release build with full optimizations ?

Sorry for involving too much details, I just want to know if this statement MOV R0,R0 really executes on GPU, whether its execution is just the same as MOV R0, R1. Overall, I’m just curious if I could believe in SASS to be the final code executed on GPU.

Generally speaking, the SASS code you extract with objdump --dump-sass is what executes on the machine. The compiler also eliminates dead code aggressively, so in general you should not see instructions like MOV R0,R0. Without examining the code, I cannot tell why this effective NOP is there. Assuming this is from an optimized build, the instruction could be there on purpose, similar to how padding with NOPs is/was sometimes used in x86 code to improve performance, or it could simply be an artifact of the compiler’s optimization process where a MOV instruction wound up with equal source and destination register late in the compilation process and no secondary dead code elimination was run after that.

In most contexts, an isolated instruction of that nature should not impact performance, but if you have indications to the contrary you could file a bug, attaching self-contained code for repro.