Vector matrix multiplication

zeek001 · November 28, 2011, 10:09pm

Hi, I created very simple function to vector matrix multiplication. It counts good but It’s very slow. Matrix is stored in global memory and I don’t know how to better count index for reaching coalescing global memory accesses. I am new in cuda, so any help will be good.

__global__ static void Multiply(int col, int row, float* dev_mat, float *dev_vec, float *dev_sol)

{

    __shared__ float Xds[BLOCK_SIZE];

	int bx = blockIdx.x; 

	int tx = threadIdx.x; 

	int Roww = bx * BLOCK_SIZE + tx;

float Pvalue = 0;

  for (unsigned int m = 0; m < (col-1)/BLOCK_SIZE+1; ++m)

    {

      if(m*BLOCK_SIZE + tx < col)

      	Xds[tx] = dev_vec[m*BLOCK_SIZE + tx]; 

      else

      	Xds[tx] = 0;

__syncthreads();	

for (unsigned int k = 0; k < BLOCK_SIZE; k++)

if(Roww<row && m*BLOCK_SIZE +k<col)

		Pvalue += dev_mat[m*BLOCK_SIZE+Roww*col+k] * Xds[k];

    }		

  if(Roww < row)  dev_sol[Roww] = Pvalue; 

}

tera · November 29, 2011, 12:51am

Older versions of the Programming Guide had a Chapter 6 that discussed matrix multiplication. It’s probably just moved elsewhere within the wealth of documentation available, but I don’t know where.

zeek001 · November 29, 2011, 9:11am

I read a lot of about vector and matrix multiplication, and I think I understand coalescing global memory accesses, but in this particular case I can’t figure out how the index count so I approached coalescing accesses.

This line:

Pvalue += dev_mat[m*BLOCK_SIZE+Roww*col+k] * Xds[k];

tera · November 29, 2011, 3:57pm

You need to ensure that both [font=“Courier New”]BLOCK_SIZE[/font] and [font=“Courier New”]col[/font] are multiples of 16 (32 for Fermi, but coalescing is less important on Fermi), and that [font=“Courier New”]dev_mat[/font], [font=“Courier New”]dev_vec[/font] and [font=“Courier New”]dev_sol[/font] are suitably aligned (they are if you obtained them via [font=“Courier New”]cudaMalloc()[/font]).

zeek001 · November 29, 2011, 5:02pm

BLOCK_SIZE is 32 and col is 8192 so they are multiples of 16. And when coalescing is less important, what should I do with It to speed it up?

tera · November 30, 2011, 2:00am

If Matrix-vector multiplication really is what you need, there is little you can do. Matrix-vector multiplication is entirely memory bandwidth bound, so once you have sorted the coalescing, you are close to optimal performance.

To achieve higher performance, the GPU needs to perform higher-level operations. E.g. matrix-matrix multiplication (when performed with a tiling algorithm) would be more memory bandwidth efficient that multiple matrix-vector multiplications.

Topic		Replies	Views
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10738	February 8, 2008
Uncoalesced on matrix by vector multiplication CUDA Programming and Performance	3	8040	June 24, 2009
Advice - Complex Matrix-Vector Multiplication CUDA Programming and Performance	3	5715	May 12, 2009
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2171	October 7, 2008
Cuda matrix-multiply, memory coalescing (corner turning) Legacy PGI Compilers cuda	2	1446	June 9, 2021
Vector-Matrix Multiplication Is this a fast kernel? CUDA Programming and Performance	5	6757	April 19, 2010
Cuda matrix-multiply, memory coalescing (corner turning) CUDA Programming and Performance	0	812	June 9, 2021
Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads CUDA Programming and Performance	0	586	December 15, 2012
Matrix multiplication CUDA CUDA Programming and Performance	7	3040	November 12, 2012
problem with coalesced memory bottleneck in cuda kernel CUDA Programming and Performance	2	723	July 5, 2011

Vector matrix multiplication

Related topics