Cublas Vs inline PTX matrix multiplication

Hello

I’ve implemented the Matrix Multiplication with inline Ptx code like this.

__global__ void inline_ptx_mm(float *a,float *b, float *c, int m, int n, int k)
{

    int row,col,c_ind;
    asm("{\n\t"
	  "mad.lo.u32 %0, %3, %4, %5;\n\t"
	  "mad.lo.u32 %1, %6, %7, %8;\n\t"
	  "mad.lo.u32 %2, %0, %9, %1;"
	  "}"
	    : "=r" ( row ),"=r"(col),"=r"(c_ind) : "r" ( blockIdx.y ) , "r" ( blockDim.y ), "r"(threadIdx.y),"r" ( blockIdx.x ) , "r" (  blockDim.x ),"r"(threadIdx.x),"r"(k));

    float sum = 0;
    
    if( col < k && row < m)
   {
	
        for(int e = 0; e < n; e++)
        {
                   asm("mad.rn.f32 %0, %1, %2, %0;" : "+f" ( sum ) : "f" (a[row*n+e] ) , "f" ( b[e*k+col] ));
        }
      c[ c_ind] = sum;
    }

}

I measured the timing with cudaevents for the inline ptx implementation and cublas.
The time taken by the inline ptx is less than the cublas with the matrix dimensions (m=4 , n =256,k =64) , But when i invoke this kernel in the place of cublas sgemm api in my project code. The overall CPU mhz increasing.

Can anyone suggest me What might be the reason?