I am sorry if this is documented somewhere, but I did not find it:
Accessing global memory costs between 400 and 600 cycles, accessing shared memory costs 4 cycles, registers add no overhead to the computation cost. Where do the built-in variables like threadIdx, blockDim, etc. reside? What is the access cost?