I wrote the following code, in which both src and t_src are global memory. And the cuda profiler told me that non-coalesce exists, but I didn’t know why.
int tid = threadIdx.x * 16 + threadIdx.y;
if(tid != 4)
t_src[tid] = src[tid];
btw, there is only 1 block which contains 16*16 threads.
In addition, I also tried"tid = threadIdx.y * 16 + threadIdx.x" , but failed.
the warp for which threadIdx.x == 0 and threadIdx.y == 4 will have a missing memory access. So that warp will have an uncoalesced access.
But according to programming guide, global access is still coalesced if all threads in a warp access a continuous section of memory in order, even though some threads do not access memory.
Is your blocksize 16xN ??
Check this from the guide:
A common global memory access pattern is when each thread of thread ID tid
accesses one element of an array located at address BaseAddress of type type*
using the following address:
BaseAddress + tid
To get memory coalescing, type must meet the size and alignment requirements
discussed above. In particular, this means that if type is a structure larger than 16
bytes, it should be split into several structures that meet these requirements and the
data should be laid out in memory as a list of several arrays of these structures
instead of a single array of type type*.
Another common global memory access pattern is when each thread of index
(tx,ty) accesses one element of a 2D array located at address BaseAddress of
type type* and of width width using the following address:
BaseAddress + width * ty + tx
In such a case, one gets memory coalescing for all half-warps of the thread block
The width of the thread block is a multiple of half the warp size; width is a multiple of 16.