confusions about coalesce access

Hello!
I wrote the following code, in which both src and t_src are global memory. And the cuda profiler told me that non-coalesce exists, but I didn’t know why.
int tid = threadIdx.x * 16 + threadIdx.y;
if(tid != 4)
{
t_src[tid] = src[tid];
}
btw, there is only 1 block which contains 16*16 threads.
In addition, I also tried"tid = threadIdx.y * 16 + threadIdx.x" , but failed.
Thanks!

the warp for which threadIdx.x == 0 and threadIdx.y == 4 will have a missing memory access. So that warp will have an uncoalesced access.

But according to programming guide, global access is still coalesced if all threads in a warp access a continuous section of memory in order, even though some threads do not access memory.

Is your blocksize 16xN ??

Check this from the guide:

A common global memory access pattern is when each thread of thread ID tid

accesses one element of an array located at address BaseAddress of type type*

using the following address:

 BaseAddress + tid

To get memory coalescing, type must meet the size and alignment requirements

discussed above. In particular, this means that if type is a structure larger than 16

bytes, it should be split into several structures that meet these requirements and the

data should be laid out in memory as a list of several arrays of these structures

instead of a single array of type type*.

Another common global memory access pattern is when each thread of index

(tx,ty) accesses one element of a 2D array located at address BaseAddress of

type type* and of width width using the following address:

 BaseAddress + width * ty + tx

In such a case, one gets memory coalescing for all half-warps of the thread block

only if:

The width of the thread block is a multiple of half the warp size;

width is a multiple of 16.