Why CUBLAS performance is not good in kepler.

Hi,
I have to do a operation which looks like below.

for(int i =0 ; i < size; i++)
{
b[i] += factor* a[i];
}

So i have written kernel which serves the purpose like below. Size = 512*512, blocks = 512, ThreadsPerBlock = 512;

__global__ void Kernel(float* a, float* b, float factor, int Size)
{
	unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
	if(threadId < Size)
	{
		b[threadId] += factor* a[threadId];
	}
}

I can replace the same operation with saxpy operation like below.

cublasStatus_t status;
cublasHandle_t handle;
cublasInit();
status = cublasCreate(&handle);
	
	 if (status != CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr, "!!!! CUBLAS initialization error\n");
        exit(0) ;
    }

cudaEventRecord(start); // using cuda event Timer
status = cublasSaxpy(handle, fullSize, &fFactor, a, 1, b, 1);
cudaEventRecord(stop);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&milliseconds, start, stop);

    if (status != CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr, "library call problem\n");
        exit(0);
    }

.

I was expecting good speed using CUBLAS library. But my naive kernel is performing better. I am using CUDA 6.5 on K20. Can some one please tell me am I doing wrong usage of CUBLAS. Why CUBLAS is not giving me good performance to me?

Thanks
Siva Rama Krishna

What are the times reported by the CUDA profiler for your kernel versus the CUBLAS kernel? You can use the simple profiler built into the CUDA driver for your measurements. Simply export CUDA_PROFILE=1 and run the app with the kernels. A log file will be written to the current directory. Remember to unset the environment variable when done profiling.

A vector of 256KB is very small, and the kernel run time will be extremely short, so your measurements at host level may be skewed by the higher overhead of the CUBLAS API call compared to your own kernel. You would want to focus on kernel execution time on the GPU.

Why do you think a library could or should perform better than your kernel?

This is a trivial arithmetic operation, for which your “naive” kernel is also approximately optimal. And you don’t have the library overhead. CUBLAS does a number of things (try profiling a cublas call), preparing for real “work”, that your kernel doesn’t do or need to do.

SAXPY is provided in CUBLAS for API convenience and completeness, not because the library writers can do a much better job than you can on such a simple operation.

Try writing a SGEMM (matrix multiply) kernel, and then do a comparison on a reasonably large matrix multiply problem (lets say 4096x4096 matrix size) between your kernel and the SGEMM library call provided by CUBLAS.

The results will shock you.