I have 6 arrays(lengths upto 16384 each) and I need to read them into my kernel and use them(no modification to them). To coalesce the memory access to them, I allocated them into shared memory. But, on profiling them, I still see the same amount of uncoalesced global memory loads. Can someone please guide me on what I am doing wrong?(Pretty new to cuda)
array a is aligned to 16 elements (64 byte aligned for floats), and
blockDim.x is a multiple of 16
Arrays allocated with cudaMalloc are always aligned to a multiple of 256 bytes, and your code should coalesce. If you are allocating all six arrays in one big block and slicing it into pieces, then arrays other than the first one might not be aligned to a multiple of 16 elements, and this could cause accesses to be uncoalesced.
Thanks for the information :) . I discovered that there was one variable which I forgot to transfer to shared memory, and that was causing the uncoalesced global accesses.