Hello
I’ve implemented the Matrix Multiplication with inline Ptx code like this.
__global__ void inline_ptx_mm(float *a,float *b, float *c, int m, int n, int k)
{
int row,col,c_ind;
asm("{\n\t"
"mad.lo.u32 %0, %3, %4, %5;\n\t"
"mad.lo.u32 %1, %6, %7, %8;\n\t"
"mad.lo.u32 %2, %0, %9, %1;"
"}"
: "=r" ( row ),"=r"(col),"=r"(c_ind) : "r" ( blockIdx.y ) , "r" ( blockDim.y ), "r"(threadIdx.y),"r" ( blockIdx.x ) , "r" ( blockDim.x ),"r"(threadIdx.x),"r"(k));
float sum = 0;
if( col < k && row < m)
{
for(int e = 0; e < n; e++)
{
asm("mad.rn.f32 %0, %1, %2, %0;" : "+f" ( sum ) : "f" (a[row*n+e] ) , "f" ( b[e*k+col] ));
}
c[ c_ind] = sum;
}
}
I measured the timing with cudaevents for the inline ptx implementation and cublas.
The time taken by the inline ptx is less than the cublas with the matrix dimensions (m=4 , n =256,k =64) , But when i invoke this kernel in the place of cublas sgemm api in my project code. The overall CPU mhz increasing.
Can anyone suggest me What might be the reason?