Why my CUDA code is slower than CPU code?

Lifan_Xu · July 17, 2011, 3:55am

Hi Guys, I wrote a CUDA code for 2D convolution,
the code is every simple as attached.

However I tested my code on Tesla, it got no misses compare with the CPU result, but it’s much slower than the CPU code:

setting device 0 with name Tesla C1060
GPU Runtime: 0.009131s
CPU Runtime: 0.001287s
Number of misses: 0

But if I ran my code on fermi card, it’s two times faster.
Anybody can tell me why?
2DConvolution.cu (4.08 KB)

tera · July 17, 2011, 8:35am

The code is entirely memory-bandwidth limited, and on pre-Fermi GPUs it makes poor use of memory bandwidth because it reloads the data for every value it calculates.

For good results on compute capability 1.x devices either use a texture, or pre-load data for the whole block to shared memory and work from there.

Lifan_Xu · July 18, 2011, 12:55pm

thank you so much for your reply!

Lifan_Xu · July 18, 2011, 12:56pm

are you saying that Fermi cards have the ability to using memory better?

itmanager85 · July 18, 2011, 1:43pm

Lifan Xu ,

(sm_11) GPU : 0 ms
(x2) CPU : 16 ms

Are you sure that your cpu compute this for 1 ms ?

tera · July 18, 2011, 4:28pm

Yes, Fermi GPUs cache global memory. However, if you are willing to do the extra programming for using a texture backed by a CUDA array, it might even be faster because it takes advantage of the 2D spatial locality.

Justin_Luitjens · July 20, 2011, 4:57pm

If you really want to get good performance you should look into using shared memory to preload the data in a block and then reuse the preloaded data from shared memory.

Topic		Replies	Views
Why tex1Dfetch faster in 10-15 times then a global memory ? tex1Dfetch faster CUDA Programming and Performance	6	852	January 3, 2012
CUDA texture memory performance CUDA Programming and Performance	0	1259	January 12, 2009
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	20932	April 20, 2011
GTX 470 Seems Slow... No Better than GTX 260? CUDA Programming and Performance	5	9306	April 29, 2010
cuda gpu slower than cpu CUDA Programming and Performance	2	1089	May 1, 2012
Tesla1060 vs GS8600 CUDA Programming and Performance	3	1586	March 11, 2010
cudaMemcpy2D slow with TESLA1060 ? CUDA Programming and Performance	3	2765	November 6, 2009
3D texture based separable convolution extension of SDK example CUDA Programming and Performance	1	1859	April 6, 2010
CUDA slower than CPU Help me please... CUDA Programming and Performance	0	2833	February 4, 2010
GTX460 is slower than 9800GT GTX460 is slower than 9800GT in Convolution Operation CUDA Programming and Performance	6	10181	February 22, 2011

Why my CUDA code is slower than CPU code?

Related topics