Yeah it looks like you’re right. Using the same operations on variables residing in shared or global memory space works just fine. Unfortunately using 64-bit variables and accessing them through 32-bit pointers isn’t coalesced in shared memory and only “pseudo coalesced” in global memory.
You mean that the nvcc compiler was really letting you take the address of an in-register variable? That is surely a bug! It should have put it into lmem, which would’ve butchered performance but given you the right result.
I would suggest filing a bug report. (Make a simple self-contained program that exhibits the problem.) I think it’s the fact that you’re using integers. Integers aren’t thoroughly tested in CUDA.