Uncoalesced on matrix by vector multiplication

danielfleischman · June 23, 2009, 8:52pm

Hi,

I am making an application which need to multiply a (row) vector by a matrix (y = x*A). So I’ve done this code:

__global__ void calc(float* mat, float* in, float* out, int tam) {

  int ix = blockIdx.x*blockDim.x + threadIdx.x;

  float ans = 0.;

  int j = 0;

  for (int i = ix; i < tam*tam; i+=tam) {

	ans += mat[i]*in[j];

	j++;

  }

  out[ix] = ans;

}

It is correct, I’ve tested it against several matrices/vectors and it is all fine.

But the problem is that when I run the cudaprof with a small example (64x64) and I get the following result: “gld coalesced = 128” and “gld uncoalesced = 2048”. I don’t know where does these uncoalesced access come from.

Can someone help me?

Thank you,

Daniel.

PS.: The “in” is declared as constant

avidday · June 23, 2009, 9:15pm

You will get uncoalesced access on 1.1 hardware any time threads in a half-warp don’t sequentially access contiguous 32/64/128 byte blocks of global memory which are aligned on 32/64/128 byte boundaries.

So (by my rather inexpert eye) your loads of in will never coalesce (every thread in the half warp reads the same value at each loop iteration), and your loads of mat would only coalesce when the stride is aligned to 64 byte boundaries.

danielfleischman · June 24, 2009, 1:26am

Thank you for your answer, but from the NVIDIA CUDA Programming Guide:

That’s what I tried to do. It seems like I need to better manage my cache… any suggestion?

Tjank you,

Daniel.

avidday · June 24, 2009, 5:46am

Re-read the complete chapter of the user guide you quoted from, because you seem to have misunderstood it. Your arrays are loading and storing to global memory, which is uncached. Constant memory and constant memory caching have no bearing on your problem.

You can probably coalesce the loads by having each thread perform staged loads to shared memory in front of a synchronization barrier first, then read from shared memory inside the loop. There is only 16k of shared memory per block, so you will have to think carefully about block sizes and, (perhaps) decomposing the operation into a smaller set of sub calculations which can work inside the shared memory limit . There is a very good set of slides written by Mark Harris from NVIDIA from SC’08 which discuss memory coalescing strategies in detail. You might find useful to study.

Topic		Replies	Views
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10761	February 8, 2008
Uncoalesced reads; Coalesced writes Same access pattern; differenct coalesced I/O outcome? CUDA Programming and Performance	5	3331	December 12, 2011
How to understand the alignment of 2D array and fully coalesce of the memory access CUDA Programming and Performance	7	3690	July 27, 2016
Coalesced Memory access related doubt CUDA Programming and Performance	13	2293	December 9, 2010
question in the sample code (simpleStream.cu) CUDA Programming and Performance	3	3867	November 26, 2007
gld coalesced = 0, but addresses are aligned! CUDA Programming and Performance	10	1759	March 20, 2010
Vector matrix multiplication CUDA Programming and Performance	5	6216	November 30, 2011
Need help on non-coalesced access CUDA Programming and Performance	0	1177	May 9, 2009
Problem about Coalesced Access CUDA Programming and Performance	1	4194	July 8, 2008
Coalesced memory access in a matrix of coefficients CUDA Programming and Performance	5	574	August 15, 2024

Uncoalesced on matrix by vector multiplication

Related topics