Shared vs Global Memory impl. of vector matrix mulltiplication

sicb0161 · February 5, 2008, 4:52pm

I have been implementing a vector matrix c = b*A multiplication using cuda. Well I am simply trying to reconfigure the simple sgemm code from the example. In the first instance I have only worked with global memory. The code is very simple:

int bBegin = __mul24(__mul24(blx,block_dim_x),HEIGHT_A);

	float Csub = 0;

	for (int a = 0, b = bBegin; a < HEIGHT_A; a += BLOCK_DIM_Y, b += BLOCK_DIM_Y) 

	{

 for (int i = 0; i < BLOCK_DIM_Y; i++)

  	Csub += global_b[a + i] * global_A[b + i + _mul24(thx,HEIGHT_A)];	

 __syncthreads();

  

	}

	int c = __mul24(blx,block_dim_x);

	global_c[c + thx] = Csub;

The program works fine if the matrix sizes are a multiple of the block dimensions. I get around 420 ms. Well I thought that extending my code by using shared memory would significantly increase the performance which did not happen! Just the contrary I got 550 ms.

int bBegin = __mul24(__mul24(blx,block_dim_x),HEIGHT_A);

	float Csub = 0;

	for (int a = 0, b = bBegin; a < HEIGHT_A; a += BLOCK_DIM_Y, b += BLOCK_DIM_Y) 

	{

  __shared__ float shared_b[BLOCK_DIM_Y];

 if(thx < BLOCK_DIM_Y)

  	shared_b[thx] = global_b[a + thx];

 __syncthreads();

 for (int i = 0; i < BLOCK_DIM_Y; i++){

  	Csub += shared_b[i] * global_A[b + i + __mul24(thx,HEIGHT_A)];  	

  }

 __syncthreads();

  

	}

	int c = __mul24(blx,block_dim_x);

global_c[c + thx] = Csub;

I tried to increase the block dimension Y, but nothing really works. What is wrong with my code ? Are there any bank conflicts ? Wrong implementation of vector matrix multiplication ?

thanks in advance

sicb0161 · February 7, 2008, 6:11pm

Okay I have been trying to optimize my code and what I found out was a little bit weird.

I first implemented the matrix vector multiplication to see if my program is able to provide a speed up and it does provide one around 5, execution time around 4 ms, matrix dimension 5120 x 5120. As the program does not deal with boundary problems and as it simply assumes that the width equals the height of the matrix, it easy to rewrite the code so it does perform a vector matrix multiplication:

#ifndef _MATVECMUL_KERNEL_H_

#define _MATVECMUL_KERNEL_H_

#define THREAD_COUNT (128)

#define XINC (THREAD_COUNT)

#define IINC (4*XINC)

#define IDXA(row,col) (HEIGHT_A*(col)+(row))

__global__ void 

MatVecMul_kernel( float* global_c, float* global_b, float* global_A, const int WIDTH_A, const int HEIGHT_A, const int LEFT_EL_X) 

{	

    __shared__ float shared_b[IINC];

    int jj, i, ii, tid, idx;

	float sdot;	

	tid  = threadIdx.x;

	jj   = blockIdx.x * THREAD_COUNT + tid;

	sdot = 0.0f;

	for (i = 0; i < HEIGHT_A; i += IINC) {

 ii = i + tid;

 __syncthreads();

 shared_b[tid + 0 * XINC] = global_b[ii + 0 * XINC];

  shared_b[tid + 1 * XINC] = global_b[ii + 1 * XINC];

  shared_b[tid + 2 * XINC] = global_b[ii + 2 * XINC];

  shared_b[tid + 3 * XINC] = global_b[ii + 3 * XINC];

 __syncthreads();

 idx = IDXA(jj,i);

  //idx = IDXA(i,jj);     <- uncomment here

  ii = 0;

  while(ii < IINC){

  	

  	sdot += global_A[idx + 0*HEIGHT_A] * shared_b[ii + 0];

  	sdot += global_A[idx + 1*HEIGHT_A] * shared_b[ii + 1];

  	sdot += global_A[idx + 2*HEIGHT_A] * shared_b[ii + 2];

  	sdot += global_A[idx + 3*HEIGHT_A] * shared_b[ii + 3];

  	sdot += global_A[idx + 4*HEIGHT_A] * shared_b[ii + 4];

  	sdot += global_A[idx + 5*HEIGHT_A] * shared_b[ii + 5];

  	sdot += global_A[idx + 6*HEIGHT_A] * shared_b[ii + 6];

  	sdot += global_A[idx + 7*HEIGHT_A] * shared_b[ii + 7];  	

  	ii   += 8;

  	idx  += 8*HEIGHT_A;

  	

  	/*

  	sdot += global_A[idx + 0] * shared_b[ii + 0];  <-- uncomment here

  	sdot += global_A[idx + 1] * shared_b[ii + 1];

  	sdot += global_A[idx + 2] * shared_b[ii + 2];

  	sdot += global_A[idx + 3] * shared_b[ii + 3];

  	sdot += global_A[idx + 4] * shared_b[ii + 4];

  	sdot += global_A[idx + 5] * shared_b[ii + 5];

  	sdot += global_A[idx + 6] * shared_b[ii + 6];

  	sdot += global_A[idx + 7] * shared_b[ii + 7];  	

  	ii   += 8;

  	idx  += 8;

  	*/

  }

  __syncthreads();

  

	}

	global_c[jj] = sdot;

}

#endif // #ifndef _MATVECMUL_KERNEL_H_

If I uncomment the parts and comment the counter parts, then I get correct results but the execution time drops from 4ms to 200ms . External Image Why is matrix vector multiplication so fast while vector matrix mulitiplication is 50 times slower? Is it the way I acces global memory? :wacko:

Thanks for help.

Cem

DenisR · February 7, 2008, 7:13pm

You can use the profiler (even better, the visual profiler) to get numbers on how many uncoalesced accesses you have to see if that is the difference between the 2.

sicb0161 · February 8, 2008, 11:15am

Allright, I have tested it with the Cuda Visual Profiler and you are right, I do have uncoalesced access because of the following access:

#define IDXA(row,col) (HEIGHT_A*(col)+(row))

In the first case row is linked with the thread number N, so that the access is performed in a consecutive way: thread number N access the address HalfWarpBaseAddress + N.

In the second case col is linked with the tid N and does meet the requirements of coalesced reading but it jumps like (HalfWarpBaseAddress + N) * HEIGHT_A. I hope this is the right explanation.

Hmm do you guys have a suggestion do meet the requirements while still accessing the matrix elements in the above mentioned way (using texture memory maybe ?? External Media ) or do I have to restructure the matrix so that coealesced reading can be done ?

Thanks a lot for your help.

Cem

Topic		Replies	Views
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2142	October 7, 2008
Uncoalesced on matrix by vector multiplication CUDA Programming and Performance	3	8027	June 24, 2009
Vector matrix multiplication CUDA Programming and Performance	5	6145	November 30, 2011
Matrix Multiplication: Shared vs Global Memory CUDA Programming and Performance	1	3716	June 27, 2011
shared memory problem CUDA Programming and Performance	2	1211	April 21, 2010
Matrix - Vector Multiplication Can't get any faster with shared memory CUDA Programming and Performance	4	7188	September 6, 2011
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1225	April 26, 2013
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3245	October 14, 2009
Advice - Complex Matrix-Vector Multiplication CUDA Programming and Performance	3	5682	May 12, 2009
Vector-Matrix Multiplication Is this a fast kernel? CUDA Programming and Performance	5	6720	April 19, 2010

Shared vs Global Memory impl. of vector matrix mulltiplication

Related topics