Why CUBLAS performance is not good in kepler.

SRK_bharadwaj · April 10, 2015, 8:29am

Hi,
I have to do a operation which looks like below.

for(int i =0 ; i < size; i++)
{
b[i] += factor* a[i];
}

So i have written kernel which serves the purpose like below. Size = 512*512, blocks = 512, ThreadsPerBlock = 512;

__global__ void Kernel(float* a, float* b, float factor, int Size)
{
	unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
	if(threadId < Size)
	{
		b[threadId] += factor* a[threadId];
	}
}

I can replace the same operation with saxpy operation like below.

cublasStatus_t status;
cublasHandle_t handle;
cublasInit();
status = cublasCreate(&handle);
	
	 if (status != CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr, "!!!! CUBLAS initialization error\n");
        exit(0) ;
    }

cudaEventRecord(start); // using cuda event Timer
status = cublasSaxpy(handle, fullSize, &fFactor, a, 1, b, 1);
cudaEventRecord(stop);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&milliseconds, start, stop);

    if (status != CUBLAS_STATUS_SUCCESS)
    {
        fprintf(stderr, "library call problem\n");
        exit(0);
    }

.

I was expecting good speed using CUBLAS library. But my naive kernel is performing better. I am using CUDA 6.5 on K20. Can some one please tell me am I doing wrong usage of CUBLAS. Why CUBLAS is not giving me good performance to me?

Thanks
Siva Rama Krishna

njuffa · April 10, 2015, 9:28am

What are the times reported by the CUDA profiler for your kernel versus the CUBLAS kernel? You can use the simple profiler built into the CUDA driver for your measurements. Simply export CUDA_PROFILE=1 and run the app with the kernels. A log file will be written to the current directory. Remember to unset the environment variable when done profiling.

A vector of 256KB is very small, and the kernel run time will be extremely short, so your measurements at host level may be skewed by the higher overhead of the CUBLAS API call compared to your own kernel. You would want to focus on kernel execution time on the GPU.

Robert_Crovella · April 11, 2015, 1:51am

Why do you think a library could or should perform better than your kernel?

This is a trivial arithmetic operation, for which your “naive” kernel is also approximately optimal. And you don’t have the library overhead. CUBLAS does a number of things (try profiling a cublas call), preparing for real “work”, that your kernel doesn’t do or need to do.

SAXPY is provided in CUBLAS for API convenience and completeness, not because the library writers can do a much better job than you can on such a simple operation.

Try writing a SGEMM (matrix multiply) kernel, and then do a comparison on a reasonably large matrix multiply problem (lets say 4096x4096 matrix size) between your kernel and the SGEMM library call provided by CUBLAS.

The results will shock you.

Topic		Replies	Views
Strange CUBLAS Saxpy result User defined kernel faster? CUDA Programming and Performance	2	5200	April 3, 2008
The use of CUBLAS. Using CUBLAS in simple C / C++ code. CUDA Programming and Performance	7	26195	March 7, 2008
Why performance is worse with CUBLAS- than with kernel-function GPU-Accelerated Libraries	3	1133	September 5, 2019
Why is cuBLAS cublasDgemm slower than my naive GEMM kernel? GPU-Accelerated Libraries cuda , kernel , cublas , cutlass	1	95	September 15, 2025
CUBLAS performance issues CUDA Programming and Performance	3	2730	March 21, 2008
Slow CUDA SGEMM CUDA Programming and Performance	5	757	September 15, 2022
cuBLAS vs CUDA kernels Performance GPU-Accelerated Libraries	1	1481	September 14, 2020
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	1028	August 23, 2018
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1546	June 27, 2018
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1581	February 1, 2010

Why CUBLAS performance is not good in kepler.

Related topics