Shared vs Global Memory impl. of vector matrix mulltiplication

I have been implementing a vector matrix c = b*A multiplication using cuda. Well I am simply trying to reconfigure the simple sgemm code from the example. In the first instance I have only worked with global memory. The code is very simple:

int bBegin = __mul24(__mul24(blx,block_dim_x),HEIGHT_A);

	float Csub = 0;

	for (int a = 0, b = bBegin; a < HEIGHT_A; a += BLOCK_DIM_Y, b += BLOCK_DIM_Y) 

	{

 for (int i = 0; i < BLOCK_DIM_Y; i++)

  	Csub += global_b[a + i] * global_A[b + i + _mul24(thx,HEIGHT_A)];	

 __syncthreads();

  

	}

	int c = __mul24(blx,block_dim_x);

	global_c[c + thx] = Csub;

The program works fine if the matrix sizes are a multiple of the block dimensions. I get around 420 ms. Well I thought that extending my code by using shared memory would significantly increase the performance which did not happen! Just the contrary I got 550 ms.

int bBegin = __mul24(__mul24(blx,block_dim_x),HEIGHT_A);

	float Csub = 0;

	for (int a = 0, b = bBegin; a < HEIGHT_A; a += BLOCK_DIM_Y, b += BLOCK_DIM_Y) 

	{

  __shared__ float shared_b[BLOCK_DIM_Y];

 if(thx < BLOCK_DIM_Y)

  	shared_b[thx] = global_b[a + thx];

 __syncthreads();

 for (int i = 0; i < BLOCK_DIM_Y; i++){

  	Csub += shared_b[i] * global_A[b + i + __mul24(thx,HEIGHT_A)];  	

  }

 __syncthreads();

  

	}

	int c = __mul24(blx,block_dim_x);

global_c[c + thx] = Csub;

I tried to increase the block dimension Y, but nothing really works. What is wrong with my code ? Are there any bank conflicts ? Wrong implementation of vector matrix multiplication ?

thanks in advance

Okay I have been trying to optimize my code and what I found out was a little bit weird.

I first implemented the matrix vector multiplication to see if my program is able to provide a speed up and it does provide one around 5, execution time around 4 ms, matrix dimension 5120 x 5120. As the program does not deal with boundary problems and as it simply assumes that the width equals the height of the matrix, it easy to rewrite the code so it does perform a vector matrix multiplication:

#ifndef _MATVECMUL_KERNEL_H_

#define _MATVECMUL_KERNEL_H_

#define THREAD_COUNT (128)

#define XINC (THREAD_COUNT)

#define IINC (4*XINC)

#define IDXA(row,col) (HEIGHT_A*(col)+(row))

__global__ void 

MatVecMul_kernel( float* global_c, float* global_b, float* global_A, const int WIDTH_A, const int HEIGHT_A, const int LEFT_EL_X) 

{	

    __shared__ float shared_b[IINC];

    int jj, i, ii, tid, idx;

	float sdot;	

	tid  = threadIdx.x;

	jj   = blockIdx.x * THREAD_COUNT + tid;

	sdot = 0.0f;

	for (i = 0; i < HEIGHT_A; i += IINC) {

 ii = i + tid;

 __syncthreads();

 shared_b[tid + 0 * XINC] = global_b[ii + 0 * XINC];

  shared_b[tid + 1 * XINC] = global_b[ii + 1 * XINC];

  shared_b[tid + 2 * XINC] = global_b[ii + 2 * XINC];

  shared_b[tid + 3 * XINC] = global_b[ii + 3 * XINC];

 __syncthreads();

 idx = IDXA(jj,i);

  //idx = IDXA(i,jj);     <- uncomment here

  ii = 0;

  while(ii < IINC){

  	

  	sdot += global_A[idx + 0*HEIGHT_A] * shared_b[ii + 0];

  	sdot += global_A[idx + 1*HEIGHT_A] * shared_b[ii + 1];

  	sdot += global_A[idx + 2*HEIGHT_A] * shared_b[ii + 2];

  	sdot += global_A[idx + 3*HEIGHT_A] * shared_b[ii + 3];

  	sdot += global_A[idx + 4*HEIGHT_A] * shared_b[ii + 4];

  	sdot += global_A[idx + 5*HEIGHT_A] * shared_b[ii + 5];

  	sdot += global_A[idx + 6*HEIGHT_A] * shared_b[ii + 6];

  	sdot += global_A[idx + 7*HEIGHT_A] * shared_b[ii + 7];  	

  	ii   += 8;

  	idx  += 8*HEIGHT_A;

  	

  	/*

  	sdot += global_A[idx + 0] * shared_b[ii + 0];  <-- uncomment here

  	sdot += global_A[idx + 1] * shared_b[ii + 1];

  	sdot += global_A[idx + 2] * shared_b[ii + 2];

  	sdot += global_A[idx + 3] * shared_b[ii + 3];

  	sdot += global_A[idx + 4] * shared_b[ii + 4];

  	sdot += global_A[idx + 5] * shared_b[ii + 5];

  	sdot += global_A[idx + 6] * shared_b[ii + 6];

  	sdot += global_A[idx + 7] * shared_b[ii + 7];  	

  	ii   += 8;

  	idx  += 8;

  	*/

  }

  __syncthreads();

  

	}

	global_c[jj] = sdot;

}

#endif // #ifndef _MATVECMUL_KERNEL_H_

If I uncomment the parts and comment the counter parts, then I get correct results but the execution time drops from 4ms to 200ms . :fear: Why is matrix vector multiplication so fast while vector matrix mulitiplication is 50 times slower? Is it the way I acces global memory? :wacko:

Thanks for help.

Cem

You can use the profiler (even better, the visual profiler) to get numbers on how many uncoalesced accesses you have to see if that is the difference between the 2.

Allright, I have tested it with the Cuda Visual Profiler and you are right, I do have uncoalesced access because of the following access:

#define IDXA(row,col) (HEIGHT_A*(col)+(row))

In the first case row is linked with the thread number N, so that the access is performed in a consecutive way: thread number N access the address HalfWarpBaseAddress + N.

In the second case col is linked with the tid N and does meet the requirements of coalesced reading but it jumps like (HalfWarpBaseAddress + N) * HEIGHT_A. I hope this is the right explanation.

Hmm do you guys have a suggestion do meet the requirements while still accessing the matrix elements in the above mentioned way (using texture memory maybe ?? :ermm: ) or do I have to restructure the matrix so that coealesced reading can be done ?

Thanks a lot for your help.

Cem