I think when you remove assignment nvcc just removes whole if { }. This if contains two global memory reads (uncoalesced, so very slow) and this is why difference is so noticeable.
I find my cuda program isn’t faster than using only C++ code to iterat the matrix, so I have to debug by comment out some line, and find that this line effect the whole speed.
YOu may check if it’s true by adding "-keep " to your nvcc command line and examining code in .PTX file.
To improve performance you need to make you memory access coalesced. Check Programming Manual for more details.
You should also examine possible thread divergence due to if { } blocks: all threads in a warp should execute same instruction. If, for example, of 32 threads in warp only one takes branch then all others have to wait for it, slowing down overall execution.