Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.
set $p0 ne u32 $r4 -0x1
add b32 $r12 shl $r13 0x2 c2[0xc8]
ld b32 $r4 ca g[$r12(null)+0]
constant buffer with unknown offset
ld b32 $r17 c2[$r17(null)+0x20]
Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.