Why my CUDA code is slower than CPU code?

Hi Guys, I wrote a CUDA code for 2D convolution,
the code is every simple as attached.

However I tested my code on Tesla, it got no misses compare with the CPU result, but it’s much slower than the CPU code:

setting device 0 with name Tesla C1060
GPU Runtime: 0.009131s
CPU Runtime: 0.001287s
Number of misses: 0

But if I ran my code on fermi card, it’s two times faster.
Anybody can tell me why?
2DConvolution.cu (4.08 KB)

The code is entirely memory-bandwidth limited, and on pre-Fermi GPUs it makes poor use of memory bandwidth because it reloads the data for every value it calculates.

For good results on compute capability 1.x devices either use a texture, or pre-load data for the whole block to shared memory and work from there.

thank you so much for your reply!

are you saying that Fermi cards have the ability to using memory better?

Lifan Xu ,

(sm_11) GPU : 0 ms
(x2) CPU : 16 ms

Are you sure that your cpu compute this for 1 ms ?

Yes, Fermi GPUs cache global memory. However, if you are willing to do the extra programming for using a texture backed by a CUDA array, it might even be faster because it takes advantage of the 2D spatial locality.

If you really want to get good performance you should look into using shared memory to preload the data in a block and then reuse the preloaded data from shared memory.