1 coalesced global memory load = 16 loads?


The help manual of ComputeProf says … …
“When simultaneous global memory accesses by threads in a half-warp
(during the execution of a single read or write instruction) can be
combined into a single memory transaction of 32, 64, or 128 bytes it
is called a coalesced access. If the global memory access by all
threads of a half-warp do not fulfill the coalescing requirements it
is called a non-coalesced access and a separate memory transaction is
issued for each thread and throughput is significantly reduced.”

When 16 threads of a half-warp each issues a global memory access (read or write) and the 16
accesses are coalesced into 1 coalesced global memory access, I believe that the profile reports it
as 1 coalesced access instead of 16.

Am I correct?

Thank you.