how to calculate A*B,where A is a 1×N matrix,B is a N×M matrix M is much larger than N

In the GeForce9600GSO, I take use of cublasegmm and get a 4× speed up, so I write a kernel on my own, and only get a 4.5× speed up.
my kernel function is as follows:

#define N 89
#define M 48000

//each thread calculate one colomn
global void calc_cuda(float* A, float* B,float* result)
{
extern shared tidcd_float shared;
const int tid = threadIdx.x;
const int bid = blockIdx.x;
const int colomnID = blockDim.x*bid + tid;
if(tid < N)
{
shared[tid] = A[tid];
}
__syncthreads();
if(colomnID < M)
{
int i = 0;
float ptmp = 0.f;
for(i = 0; i < N; i++)
{
ptmp += shared[i]*B[i*M+ colomnID ];
}
result[colomnID ] = ptmp;
}
}

Are there any unproper operations above that cause low effiency?

gemv is memory-bound, I will suggest loop-unrolling.

decompose N = 16 * 5 + 9

for (int i = 0; i < 5; i++ ){

	#pragma unroll

	for(int j = 0; j < 16; j++)

		 ptmp +=  shared[i*16 + j]*B[(16*i+j)*M+ colomnID ];

}

for ( int j = 0; j < 9; j++){

	ptmp +=  shared[80 + j]*B[(80+j)*M+ colomnID ];

}

Of course you can re-write index computation to save code of index computation