evaluating global memory access trade-off


perhaps sounds like a stupid question…
I’m trying to find the reasoble explanation to the behaviour i observe

I have a kernel with a lot of arithmetic inside.
Say if I launch a normal grid of 64K blocks (each of which having 128 threads), the running time is about 33 ms

global memory transfer from host to GPU is ~128Mb

then i wanted to evaluate the trade-off of global memory reads, so I replaced them by dummy statements -
to be precise, i replaced all ‘g_mem[thid]’ by ‘(unsigned&)g_mem + thid’ - so that no global memory access is done but compiler nevertheless does not optimize out
trivial statements

strangely enough, execution time (with the same grid configuration) does not change significantly - it’s about 33.8 ms, although it was natural to expect larger speed-up

can I then assume that all memory reads are “perfectly” hidden or there is smth screwed up with my timer ?


P.S. my card is GTX280