Global Memory Coalescing Help

Hi everyone,

I’m attempting to coalesce the global memory access for a kernel.

thread size is 16x16.

v is an array of floats of size 64x64 = 4096.

sv is an array in shared memory, of size 17x17 = 289.

In the beginning of the kernel, I do this:

i = (blockDim.x * blockIdx.x + threadIdx.x) + (blockDim.y * blockIdx.y + threadIdx.y) * 64

j = threadIdx.x + (blockDim.x+1) * threadIdx.y

sv[j] = v[i]

if (threadIdx.x == blockDim.x - 1)

  Â sv[j+1] = v[i+1]

if (threadIdx.y == blockDim.y - 1)

  Â sv[j+17] = v[i+64]

Of the 3 memory accesses, the first one and the last one are coalesced. However, I can’t think of any way to coalesce the second access.

Thanks!

It’s hard to explain, but here are my quick thoughts:

Find the formula that describes the relationship of i and j, and only access global memory when threadIdx.x % 16 == globalIndex % 16. To do this, you just need to bias the writes into shared memory differently.

If threadIdx.x == blockDim.x - 1, threadIdx.x is 15.
The second access is uncoalesced because
threadIdx.x == 15
and
15 % 16 == 15
so
(i in threadIdx.x == 15) % 16 == 15
but
((i + 1) in threadIdx.x == 15) % 16 = 0

threadIdx.x == 0 should be writing into that position in shared memory, if I understand correctly.