I was looking through some PTX code generated by nvcc (with /O2 optimization) and it seemed like there were lots of redundant instructions loading immediate values into registers. For instance, in the C/CUDA code I had a float3 with an overloaded multiply operator, like this…
item = item * 0.5
and the PTX code listing looked like this …
mov.f32 %f95, 0f3f000000; // 0.5
mul.f32 %f43, %f43, %f95; //
mov.f32 %f96, 0f3f000000; // 0.5
mul.f32 %f45, %f45, %f96; //
mov.f32 %f97, 0f3f000000; // 0.5
mul.f32 %f47, %f47, %f97; //
So, there are 3 separate instructions loading 0.5 into a register. Is there is a reason the code is generated like this? Would the PTX compiler do any further optimization?
Kalle