L2 cache sensitive CUDA benchmarks

Hello,

I am a student doing some research on memory intensive CUDA code. For my performance analysis, I need some CUDA applications that exploit the L2 temporal (or spatial) locality (i.e. some application that repeatedly access data that is cached in L2). Can someone point me to few such applications?