The code is entirely memory-bandwidth limited, and on pre-Fermi GPUs it makes poor use of memory bandwidth because it reloads the data for every value it calculates.
For good results on compute capability 1.x devices either use a texture, or pre-load data for the whole block to shared memory and work from there.
Yes, Fermi GPUs cache global memory. However, if you are willing to do the extra programming for using a texture backed by a CUDA array, it might even be faster because it takes advantage of the 2D spatial locality.
If you really want to get good performance you should look into using shared memory to preload the data in a block and then reuse the preloaded data from shared memory.