Cublas Vs inline PTX matrix multiplication

himajyothi802 · September 29, 2021, 6:28am

Hello

I’ve implemented the Matrix Multiplication with inline Ptx code like this.

__global__ void inline_ptx_mm(float *a,float *b, float *c, int m, int n, int k)
{

    int row,col,c_ind;
    asm("{\n\t"
	  "mad.lo.u32 %0, %3, %4, %5;\n\t"
	  "mad.lo.u32 %1, %6, %7, %8;\n\t"
	  "mad.lo.u32 %2, %0, %9, %1;"
	  "}"
	    : "=r" ( row ),"=r"(col),"=r"(c_ind) : "r" ( blockIdx.y ) , "r" ( blockDim.y ), "r"(threadIdx.y),"r" ( blockIdx.x ) , "r" (  blockDim.x ),"r"(threadIdx.x),"r"(k));

    float sum = 0;
    
    if( col < k && row < m)
   {
	
        for(int e = 0; e < n; e++)
        {
                   asm("mad.rn.f32 %0, %1, %2, %0;" : "+f" ( sum ) : "f" (a[row*n+e] ) , "f" ( b[e*k+col] ));
        }
      c[ c_ind] = sum;
    }

}

I measured the timing with cudaevents for the inline ptx implementation and cublas.
The time taken by the inline ptx is less than the cublas with the matrix dimensions (m=4 , n =256,k =64) , But when i invoke this kernel in the place of cublas sgemm api in my project code. The overall CPU mhz increasing.

Can anyone suggest me What might be the reason?

Topic		Replies	Views
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	1044	August 23, 2018
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	1212	September 6, 2023
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1554	June 27, 2018
Slow CUDA SGEMM CUDA Programming and Performance	5	766	September 15, 2022
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	5136	February 10, 2011
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28217	February 1, 2011
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3497	December 26, 2008
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1507	February 12, 2010
Faster MatrixMult than CUBLAS! CUDA Programming and Performance	4	2864	September 4, 2009
Matrix multiplication performance issue CUDA Programming and Performance	14	274	June 12, 2025

Cublas Vs inline PTX matrix multiplication

Related topics