Vector matrix multiplication

Hi, I created very simple function to vector matrix multiplication. It counts good but It’s very slow. Matrix is stored in global memory and I don’t know how to better count index for reaching coalescing global memory accesses. I am new in cuda, so any help will be good.

__global__ static void Multiply(int col, int row, float* dev_mat, float *dev_vec, float *dev_sol)


    __shared__ float Xds[BLOCK_SIZE];

	int bx = blockIdx.x; 

	int tx = threadIdx.x; 

	int Roww = bx * BLOCK_SIZE + tx;

float Pvalue = 0;

  for (unsigned int m = 0; m < (col-1)/BLOCK_SIZE+1; ++m)


      if(m*BLOCK_SIZE + tx < col)

      	Xds[tx] = dev_vec[m*BLOCK_SIZE + tx]; 


      	Xds[tx] = 0;


for (unsigned int k = 0; k < BLOCK_SIZE; k++)

if(Roww<row && m*BLOCK_SIZE +k<col)

		Pvalue += dev_mat[m*BLOCK_SIZE+Roww*col+k] * Xds[k];


  if(Roww < row)  dev_sol[Roww] = Pvalue; 


Older versions of the Programming Guide had a Chapter 6 that discussed matrix multiplication. It’s probably just moved elsewhere within the wealth of documentation available, but I don’t know where.

I read a lot of about vector and matrix multiplication, and I think I understand coalescing global memory accesses, but in this particular case I can’t figure out how the index count so I approached coalescing accesses.

This line:

Pvalue += dev_mat[m*BLOCK_SIZE+Roww*col+k] * Xds[k];

You need to ensure that both [font=“Courier New”]BLOCK_SIZE[/font] and [font=“Courier New”]col[/font] are multiples of 16 (32 for Fermi, but coalescing is less important on Fermi), and that [font=“Courier New”]dev_mat[/font], [font=“Courier New”]dev_vec[/font] and [font=“Courier New”]dev_sol[/font] are suitably aligned (they are if you obtained them via [font=“Courier New”]cudaMalloc()[/font]).

BLOCK_SIZE is 32 and col is 8192 so they are multiples of 16. And when coalescing is less important, what should I do with It to speed it up?

If Matrix-vector multiplication really is what you need, there is little you can do. Matrix-vector multiplication is entirely memory bandwidth bound, so once you have sorted the coalescing, you are close to optimal performance.

To achieve higher performance, the GPU needs to perform higher-level operations. E.g. matrix-matrix multiplication (when performed with a tiling algorithm) would be more memory bandwidth efficient that multiple matrix-vector multiplications.