"daxpy" in CUBLAS is slower than that in MKL


I’m using both CUBLAS and MKL for my research. “dgemm” in CUBLAS is much much faster than that in MKL on my computer, but I found that “daxpy” in CUBLAS is slightly slower than that in MKL on the same machine. Why is that? Can anyone give me a hint?

Thank you in advance!

I’m quite sure it isn’t.

DAXPY is entirely memory bandwidth bound, and the memory bandwidth of your GPU is likely 5× to 20× higher than that of your CPU. HOWEVER, you might have chosen a vector length where it might still completely execute with the cache on the CPU but not in the GPU, so that for your particular vector length the CPU has an edge.

In general, try to work at a higher level then vector-vector operations to achieve a better ratio between arithmetic operations and memory accesses.

What model of GPU are you using? There are some GPUs which are easily slower than a good CPU running MKL.

I’m using Intel Xeon X5679 and Tesla C2070.

Thank you, tera! I remember that you have solved many of my posts here. After I posted my problem, I also googled, and found the AMD people gave the similar answer that “SAXPY and DAXPY … cannot benefit from GPU accelleration”.

See the URL:


If you need to transfer data to/from the gpu for a single daxpy call, then yes, the gpu will be a net loss. Most people tend to try and write their entire app on the gpu so that the memory transfer overhead is minimal.

Yes! The data transfer is the bottleneck for this problem!

I don’t guess using zero-copy does any significant improvement in this case?