# Vector matrix multiplication

Hi, I created very simple function to vector matrix multiplication. It counts good but It’s very slow. Matrix is stored in global memory and I don’t know how to better count index for reaching coalescing global memory accesses. I am new in cuda, so any help will be good.

``````__global__ static void Multiply(int col, int row, float* dev_mat, float *dev_vec, float *dev_sol)

{

__shared__ float Xds[BLOCK_SIZE];

int bx = blockIdx.x;

int Roww = bx * BLOCK_SIZE + tx;

float Pvalue = 0;

for (unsigned int m = 0; m < (col-1)/BLOCK_SIZE+1; ++m)

{

if(m*BLOCK_SIZE + tx < col)

Xds[tx] = dev_vec[m*BLOCK_SIZE + tx];

else

Xds[tx] = 0;

for (unsigned int k = 0; k < BLOCK_SIZE; k++)

if(Roww<row && m*BLOCK_SIZE +k<col)

Pvalue += dev_mat[m*BLOCK_SIZE+Roww*col+k] * Xds[k];

}

if(Roww < row)  dev_sol[Roww] = Pvalue;

}
``````

Older versions of the Programming Guide had a Chapter 6 that discussed matrix multiplication. It’s probably just moved elsewhere within the wealth of documentation available, but I don’t know where.

I read a lot of about vector and matrix multiplication, and I think I understand coalescing global memory accesses, but in this particular case I can’t figure out how the index count so I approached coalescing accesses.

This line:

``````Pvalue += dev_mat[m*BLOCK_SIZE+Roww*col+k] * Xds[k];
``````

You need to ensure that both [font=“Courier New”]BLOCK_SIZE[/font] and [font=“Courier New”]col[/font] are multiples of 16 (32 for Fermi, but coalescing is less important on Fermi), and that [font=“Courier New”]dev_mat[/font], [font=“Courier New”]dev_vec[/font] and [font=“Courier New”]dev_sol[/font] are suitably aligned (they are if you obtained them via [font=“Courier New”]cudaMalloc()[/font]).

BLOCK_SIZE is 32 and col is 8192 so they are multiples of 16. And when coalescing is less important, what should I do with It to speed it up?

If Matrix-vector multiplication really is what you need, there is little you can do. Matrix-vector multiplication is entirely memory bandwidth bound, so once you have sorted the coalescing, you are close to optimal performance.

To achieve higher performance, the GPU needs to perform higher-level operations. E.g. matrix-matrix multiplication (when performed with a tiling algorithm) would be more memory bandwidth efficient that multiple matrix-vector multiplications.