Memoy Tiling

I recently saw an article discussing how memory tiling can be used to save time when you need to access neighboring locations. Does anybody have an example of anything like this?

Thanks!

The particular article is Optimizing CUDA by Paulius Micikevicius.
http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf

Halfway through the presentation is a section on 3D Finite Difference. My application has a similar need to access the neighboring array values. My first few attempts to implement this concept have failed. I would be extremely grateful if anyone could point me towards an example application that demonstrates this concept.

Thanks!