Coalescing memory accesses Need help with coalescing


I have 6 arrays(lengths upto 16384 each) and I need to read them into my kernel and use them(no modification to them). To coalesce the memory access to them, I allocated them into shared memory. But, on profiling them, I still see the same amount of uncoalesced global memory loads. Can someone please guide me on what I am doing wrong?(Pretty new to cuda)

Example code :

__shared__ float sa[512];

__shared__ float sb[512];

__shared__ float sc[512];

__shared__ float sd[512];

__shared__ float se[512];

__shared__ float sf[512];

int idx = blockIdx.x * blockDim.x + threadIdx.x;

sa[threadIdx.x] = a[idx];

sb[threadIdx.x] = b[idx];

sc[threadIdx.x] = c[idx];

sd[threadIdx.x] = d[idx];

se[threadIdx.x] = e[idx];

sf[threadIdx.x] = f[idx];

Nvidia Geforce 8600GT

Cuda 2.1

You don’t need to worry about coalescing when accessing shared memory. Coalescing only has to do with accessing global memory.

int idx = blockIdx.x * blockDim.x + threadIdx.x;

sa[threadIdx.x] = a[idx];

This will coalesce reading from array a if:

  1. array a is aligned to 16 elements (64 byte aligned for floats), and

  2. blockDim.x is a multiple of 16

Arrays allocated with cudaMalloc are always aligned to a multiple of 256 bytes, and your code should coalesce. If you are allocating all six arrays in one big block and slicing it into pieces, then arrays other than the first one might not be aligned to a multiple of 16 elements, and this could cause accesses to be uncoalesced.

Thanks for the information :) . I discovered that there was one variable which I forgot to transfer to shared memory, and that was causing the uncoalesced global accesses.