Tiny bit of uncoalesced access detected that I can't explain

I am using CUDA 11.2 NSIGHT Compute to analyze a kernel.

I get a warning about uncoalesced access. Even though it is a tiny amount: 131,072 instead of 131,070, I wonder where it comes from?

All three lines that are warned about, access global memory with a well-defined pattern:
blockIdx.x * blockDim.x + threadIdx.x so it’s strange.

Why could the sector count be off by just 2?

If I have to guess, it’s the expected value that is off, because it’s not a round number.

>>> hex(131070)
>>> hex(131072)

I have 16384 blocks of size 128, and I am on Ubuntu with an rtx3070. The offending code reads __half values, by the way.

The value of 131070 sectors appears to be an issue. Can you please share the CUDA code to help us with reproducing the issue?