In the PTX code I see lots of st.local commands of a register value to a local without any ld.local anywhere. The value is always taken from the register when needed.
Why is this happening? Why am I getting this (single) local variable instead of using a register. Compiler says I’m using only 19 regs.
Var is defined as: .local .align 4 .b8 __cuda_col_0[4];
Impossible to say without seeing the complete source code and the build command. Is this a debug build by any chance (nvcc -g -G) ?
In all likelihood you can completely ignore this dead code issue, it is probably just an implementation artifact. Remember that PTX is only an intermediate format that the compiler backend (PTXAS) translates into the actual machine code that executes on the GPU. PTXAS will eliminate dead code, allocate physical registers, and schedule machine instructions. You can examine the generated machine code (called SASS) by invoking cuobjdump --dump-sass on your object file or executable. I would predict that local stores without corresponding local loads are not present in the SASS produced by a release build.