"daxpy" in CUBLAS is slower than that in MKL

xinwu · June 29, 2011, 2:26pm

Hello!

I’m using both CUBLAS and MKL for my research. “dgemm” in CUBLAS is much much faster than that in MKL on my computer, but I found that “daxpy” in CUBLAS is slightly slower than that in MKL on the same machine. Why is that? Can anyone give me a hint?

Thank you in advance!

tera · June 29, 2011, 2:57pm

I’m quite sure it isn’t.

DAXPY is entirely memory bandwidth bound, and the memory bandwidth of your GPU is likely 5Ã— to 20Ã— higher than that of your CPU. HOWEVER, you might have chosen a vector length where it might still completely execute with the cache on the CPU but not in the GPU, so that for your particular vector length the CPU has an edge.

In general, try to work at a higher level then vector-vector operations to achieve a better ratio between arithmetic operations and memory accesses.

seibert · June 29, 2011, 3:07pm

What model of GPU are you using? There are some GPUs which are easily slower than a good CPU running MKL.

xinwu · June 29, 2011, 3:22pm

I’m using Intel Xeon X5679 and Tesla C2070.

Thank you, tera! I remember that you have solved many of my posts here. After I posted my problem, I also googled, and found the AMD people gave the similar answer that “SAXPY and DAXPY … cannot benefit from GPU accelleration”.

See the URL:

http://forums.amd.com/forum/messageview.cfm?catid=217&threadid=117584

eelsen · June 29, 2011, 5:37pm

If you need to transfer data to/from the gpu for a single daxpy call, then yes, the gpu will be a net loss. Most people tend to try and write their entire app on the gpu so that the memory transfer overhead is minimal.

xinwu · July 1, 2011, 10:24am

Yes! The data transfer is the bottleneck for this problem!

Jimmy_Pettersson · July 1, 2011, 12:14pm

I don’t guess using zero-copy does any significant improvement in this case?

Topic		Replies	Views
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1504	February 1, 2010
CUBLAS VS CUDA Kernel CUDA Programming and Performance	2	6809	August 15, 2007
Strange CUBLAS Saxpy result User defined kernel faster? CUDA Programming and Performance	2	5122	April 3, 2008
CUBLAS Performance Many algorithms perform abysmally CUDA Programming and Performance	6	7599	February 3, 2008
Cuda SGEMM same speed as APPLE veclibs ? CUDA Programming and Performance	8	10619	May 8, 2008
Why performance is worse with CUBLAS- than with kernel-function GPU-Accelerated Libraries	3	969	September 5, 2019
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10040	March 24, 2014
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3727	April 4, 2007
Why CUBLAS performance is not good in kepler. GPU-Accelerated Libraries	2	950	April 11, 2015
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18202	March 30, 2011

"daxpy" in CUBLAS is slower than that in MKL

Related topics