Strange CUBLAS Saxpy result User defined kernel faster?

Hello mates,

I believe many of you use CUBLAS functions for iterative methods. Just for fun, I tried to compare CUBLAS Saxpy with my own simple kernel which follows:

extern "C"

__global__ void SaxpyV1Kern(float alpha, float* vecX, float* vecY) {

    int id = (blockIdx.x * blockDim.x) + threadIdx.x;

    vecY[id] = alpha*vecX[id] + vecY[id];


My Device: GeForce 8800 GTX

Usage Pattern:

  • CUDA Driver API

  • 1D blocks and 1D grid

  • 256 threads in a block

My strange measurements for vector size=2^22=4194106 elements:

CPU Performance/Throughput: 0.724 GFLOPS/4.04 GB/s

GPU User Defined Kernel Performance/Throughput: 12.652 GFLOPS/70.7035 GB/s

GPU CUBLAS Performance/Throughput: 10.641 GFLOPS/59.463 GB/s

What am I doing wrong? I believe CUBLAS is highly optimized and user defined SAXPY kernel shouldn’t be faster :-( Any help appreciated…

You can look at the CUBLAS source and see what NVIDIA wrote:

The CUBLAS saxpy is more general than your version, including incx and incy options for array stride. That extra indexing arithmetic probably slows down the CUBLAS code relative to your code, which assumes consecutive elements.

Whooooow, I raise my hat to you Seibert. Your reply is much better than I expected. I didn’t know CUBLAS sources are availible somewhere for downloading. I trusted NVIDIA keeps them in secrecy. Using you link, I can study CUBLAS code with attention. THANK YOU!!!