Hello!
I am trying to develop CUDA kernels in PTX, and I see strange unnecessary MOV instruction generated. (On my laptop’s GTX 1060 - Pascal GPU.)
For example this code:
“.reg.f32 a;\n\t”
“mov.f32 a, 1.0;\n\t”
“fma.rn.ftz.f32 a, a, a, a;\n\t”
“fma.rn.ftz.f32 a, a, a, a;\n\t”
“fma.rn.ftz.f32 a, a, a, a;\n\t”
“st.global.f32 [dst_ptr], a;\n\t”
is compiled to the one below:
0x003b9908 MOV32I R3, 0x3f800000
0x003b9910 FFMA.FTZ R3, R3, R3, R3
0x003b9918 FFMA.FTZ R3, R3, R3, R3
0x003b9928 FFMA.FTZ R3, R3, R3, R3
0x003b9930 MOV R4, R0
0x003b9938 MOV R5, R2
0x003b9948 MOV R4, R4
0x003b9950 MOV R5, R5
0x003b9958 STG.E [R4], R3
Now the question is: are those MOV instructions really what the GPU executes? If so then is it because I am running in DEBUG mode (otherwise breakpoints do not hit)? In other words: is there another peephole optimization step when loading the code to the GPU which gets rid of those unnecessary MOV instructions, which is disabled in DEBUG mode? Or is it real?
There would be another explanation that those excess instructions are inserted because of instruction dependencies, but the FFMA instructions clearly have dependencies, and no MOVS inserted there. Also an FFMA is supposed to take 6 clocks, so there are not enough MOVS either (or I am wrong about that).
Thanks,
Andrew