I’m attempting to coalesce the global memory access for a kernel.
thread size is 16x16.
v is an array of floats of size 64x64 = 4096.
sv is an array in shared memory, of size 17x17 = 289.
In the beginning of the kernel, I do this:
i = (blockDim.x * blockIdx.x + threadIdx.x) + (blockDim.y * blockIdx.y + threadIdx.y) * 64 j = threadIdx.x + (blockDim.x+1) * threadIdx.y sv[j] = v[i] if (threadIdx.x == blockDim.x - 1) Â sv[j+1] = v[i+1] if (threadIdx.y == blockDim.y - 1) Â sv[j+17] = v[i+64]
Of the 3 memory accesses, the first one and the last one are coalesced. However, I can’t think of any way to coalesce the second access.