Is the SASS disassembly shown in Visual Studio real?

andras.rafas · September 29, 2022, 8:59am

Hello!

I am trying to develop CUDA kernels in PTX, and I see strange unnecessary MOV instruction generated. (On my laptop’s GTX 1060 - Pascal GPU.)

For example this code:
“.reg.f32 a;\n\t”
“mov.f32 a, 1.0;\n\t”
“fma.rn.ftz.f32 a, a, a, a;\n\t”
“fma.rn.ftz.f32 a, a, a, a;\n\t”
“fma.rn.ftz.f32 a, a, a, a;\n\t”
“st.global.f32 [dst_ptr], a;\n\t”

is compiled to the one below:
0x003b9908 MOV32I R3, 0x3f800000
0x003b9910 FFMA.FTZ R3, R3, R3, R3
0x003b9918 FFMA.FTZ R3, R3, R3, R3
0x003b9928 FFMA.FTZ R3, R3, R3, R3
0x003b9930 MOV R4, R0
0x003b9938 MOV R5, R2
0x003b9948 MOV R4, R4
0x003b9950 MOV R5, R5
0x003b9958 STG.E [R4], R3

Now the question is: are those MOV instructions really what the GPU executes? If so then is it because I am running in DEBUG mode (otherwise breakpoints do not hit)? In other words: is there another peephole optimization step when loading the code to the GPU which gets rid of those unnecessary MOV instructions, which is disabled in DEBUG mode? Or is it real?

There would be another explanation that those excess instructions are inserted because of instruction dependencies, but the FFMA instructions clearly have dependencies, and no MOVS inserted there. Also an FFMA is supposed to take 6 clocks, so there are not enough MOVS either (or I am wrong about that).

Thanks,
Andrew

njuffa · September 29, 2022, 10:40am

Precisely that. Seeing unnecessary MOV instructions in the SASS output is a surefire sign of a debug build. For debug builds, the CUDA toolchain disables all optimizations, and (by observation) even introduces certain “anti-optimizations”. The apparent reason is to make variables trackable at all times and allow matching of source lines to instructions. If you have ever looked at the SASS generated by an optimized release build for even a moderately complex piece of code, you would agree that it is utterly impossible to do either of those two things reliably with a release build.

Side remark: Outside of investigating a compiler issue, looking at PTX code is rarely worthwhile, because PTX is both a virtual ISA and a compiler intermediate format, and the code is in something like an SSA format (each new result is assigned to a new virtual register). This gets compiled down to SASS (machine code) by ptxas, which is an optimizing compiler.

andras.rafas · September 29, 2022, 1:00pm

Thanks, now I used
ptxas test1.ptx --gpu-name sm_61
cuobjdump --dump-sass elf.o
And there are no unnecessary MOVs there. Even constant memory accesses are inlined into the ops, perfect.

However, now I have this problem that I have no idea how to set breakpoints on the code compiled from a PTX file. Because as long as the code is compiled in Visual Studio using inline assembly in the .cu file, then the breakpoint is hit when starting CUDA debugging. But the disassembly is deoptimized.

On the other hand, I tried to link the separately compiled elf.o into the executable, but it is said in a lot of forum threads that it is impossible. But when I load the elf.o code via cuModuleLoadData(), then there is no source file in VS where I can put the breakpoint.

So I am a little bit confused how to debug the optimized assembly. Or can it be that no matter what I try, CUDA debugging will deoptimize even the RELEASE version optimized assembly?

andras.rafas · September 29, 2022, 1:03pm

Ok, sorry, disregard this last question.

I have figured out that the RELEASE version is fine, but I have to enable !!!ONLY!!! the generation of line number debug info. Wanting full debug info generates a DEBUG build.

system · October 13, 2022, 1:03pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.