Profiling coalesced memory accesses confusion

I’m using a simple matrix mult kernel to test global memory access patterns for efficiency.
Based on Kirk & Hwu, Programming Massively Parallel Processors, 3e, they have a criterion for determining if access will be coalesced, which is stated as:

Accesses in a warp are to consecutive locations if the index in an array access is in the form of
A[ (expression with terms independent of threadIdx.x) + threadIdx.x ];

Here’s my mmult kernel, with three different access pattern cases: original, pattern-1, and pattern-2.
I only uncomment one per build, and then profile with Nsight Compute.

global void matrixMulGPU( int * a, int * b, int * c )
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

int val = 0;
if( row < N && col < N )
    for(int k = 0; k < N; ++k)
      //val += a[row*N + k] * b[k*N + col];  // Original, correct access pattern
  //val += a[row*N + k] * b[row*N + k];  // test-only, Pattern-1
  val += a[k*N + col] * b[k*N + col];    // test-only, Pattern-2
    c[row*N + col] = val;


I’m launching this with threadsPerBlock = {32,16,0}

In Warp State Statistics I see: Avg. Active Threads Per warp = 32
In the Memory Workload Analysis Chart section,
L1/TEX Cache
Global Load, Sectors/Req = 2.5 (for ORIGINAL pattern)
Global Load, Sectors/Req = 1 (for Pattern-1)
Global Load, Sectors/Req = 4 (for Pattern-2)

BUT based on the Kirk/Hwu criterion, I would expect that Pattern-2 would have the best coalesced access,
because the col variable is the one isolating the threadIdx.x term.
This doesnt seem to be what Nsight Compute is saying, which is that Pattern-1 has the best memory access because less requests per sector is better…
Am I overlooking something?

A fully coalesced 4 bytes memory access with 32 active threads per warp, has an expected value of 4 for metric sectors/request. The total amount of data requested is 32*4bytes = 128 bytes. As the access is coalesced and for this discussion aligned with the sector boundaries, 128 bytes equal 4 consecutive 32 byte sectors in memory.

|--------| |--------| |--------| |--------| (sectors)

And as stated, this is indeed the optimal case, if all 32 threads are active and have unique addresses. However, an access can have still fewer sectors/request if A) less threads are active or B) the addresses are not unique. In the given examples, the latter case is true for some of the accesses. First, let us consider the case where all 32 threads are active, and all read from the very same address. The equivalent pattern to the above example would look this this:

|--------| |--------| |--------| |--------| (sectors)
 X                                          (thread accesses)

Now we only need to access a single sector as all 32 threads requested the same index in the data array. Therefore, sectors/request is 1.

In your example code, the accesses with index calculation row*N + k fall in this category. row is defined as threadIdx.y + blockIdx.y * blockDim.y. That is constant for any given warp of 32 threads as your block size is {32,16,0}, i.e., the threads of a warp have different x-indices, but the same y-index. N is likely constant. And the loop counter k is also the same for the 32 threads in a warp. Consequently, the address for all 32 threads in the warp is identical. And Nsight Compute reports the correct 1 sector/request. The index calculation k*N + col leads to a fully coalesced memory pattern with 32 threads active. Sectors/request for that is 4. This leads to the following three cases in your example:

Original access: (a[row*N + k] = 1s/r) + (b[k*N + col] = 4s/r) = 5sectors/2request = 2.5s/r
Pattern-1:       (a[row*N + k] = 1s/r) + (b[row*N + k] = 1s/r) = 2sectors/2request = 1.0s/r
Pattern-2:       (a[k*N + col] = 4s/r) + (b[k*N + col] = 4s/r) = 8sectors/2request = 4.0s/r