Hi, I created very simple function to vector matrix multiplication. It counts good but It’s very slow. Matrix is stored in global memory and I don’t know how to better count index for reaching coalescing global memory accesses. I am new in cuda, so any help will be good.
__global__ static void Multiply(int col, int row, float* dev_mat, float *dev_vec, float *dev_sol)
{
__shared__ float Xds[BLOCK_SIZE];
int bx = blockIdx.x;
int tx = threadIdx.x;
int Roww = bx * BLOCK_SIZE + tx;
float Pvalue = 0;
for (unsigned int m = 0; m < (col-1)/BLOCK_SIZE+1; ++m)
{
if(m*BLOCK_SIZE + tx < col)
Xds[tx] = dev_vec[m*BLOCK_SIZE + tx];
else
Xds[tx] = 0;
__syncthreads();
for (unsigned int k = 0; k < BLOCK_SIZE; k++)
if(Roww<row && m*BLOCK_SIZE +k<col)
Pvalue += dev_mat[m*BLOCK_SIZE+Roww*col+k] * Xds[k];
}
if(Roww < row) dev_sol[Roww] = Pvalue;
}
Older versions of the Programming Guide had a Chapter 6 that discussed matrix multiplication. It’s probably just moved elsewhere within the wealth of documentation available, but I don’t know where.
I read a lot of about vector and matrix multiplication, and I think I understand coalescing global memory accesses, but in this particular case I can’t figure out how the index count so I approached coalescing accesses.
You need to ensure that both [font=“Courier New”]BLOCK_SIZE[/font] and [font=“Courier New”]col[/font] are multiples of 16 (32 for Fermi, but coalescing is less important on Fermi), and that [font=“Courier New”]dev_mat[/font], [font=“Courier New”]dev_vec[/font] and [font=“Courier New”]dev_sol[/font] are suitably aligned (they are if you obtained them via [font=“Courier New”]cudaMalloc()[/font]).
If Matrix-vector multiplication really is what you need, there is little you can do. Matrix-vector multiplication is entirely memory bandwidth bound, so once you have sorted the coalescing, you are close to optimal performance.
To achieve higher performance, the GPU needs to perform higher-level operations. E.g. matrix-matrix multiplication (when performed with a tiling algorithm) would be more memory bandwidth efficient that multiple matrix-vector multiplications.