ptx optimization

I was looking through some PTX code generated by nvcc (with /O2 optimization) and it seemed like there were lots of redundant instructions loading immediate values into registers. For instance, in the C/CUDA code I had a float3 with an overloaded multiply operator, like this…
item = item * 0.5

and the PTX code listing looked like this …
mov.f32 %f95, 0f3f000000; // 0.5
mul.f32 %f43, %f43, %f95; //
mov.f32 %f96, 0f3f000000; // 0.5
mul.f32 %f45, %f45, %f96; //
mov.f32 %f97, 0f3f000000; // 0.5
mul.f32 %f47, %f47, %f97; //

So, there are 3 separate instructions loading 0.5 into a register. Is there is a reason the code is generated like this? Would the PTX compiler do any further optimization?


PTX is not the final cubin code the processor runs. PTX is an intermediate form before any optimization.

To look at your real code, you may try decuda.

try this code for comparison

volatile float pointfive = 0.5f;

item *= pointfive;


Thanks, that’s a useful utility. Looks like those were turned into multiply instructions with 0.5f as the immediate value.