In the GeForce9600GSO, I take use of cublasegmm and get a 4× speed up, so I write a kernel on my own, and only get a 4.5× speed up.
my kernel function is as follows:
//each thread calculate one colomn
global void calc_cuda(float* A, float* B,float* result)
{
extern shared tidcd_float shared;
const int tid = threadIdx.x;
const int bid = blockIdx.x;
const int colomnID = blockDim.x*bid + tid;
if(tid < N)
{
shared[tid] = A[tid];
}
__syncthreads();
if(colomnID < M)
{
int i = 0;
float ptmp = 0.f;
for(i = 0; i < N; i++)
{
ptmp += shared[i]*B[i*M+ colomnID ];
}
result[colomnID ] = ptmp;
}
}
Are there any unproper operations above that cause low effiency?