Profiling coalesced memory accesses confusion

colinrei · February 23, 2021, 11:09pm

I’m using a simple matrix mult kernel to test global memory access patterns for efficiency.
Based on Kirk & Hwu, Programming Massively Parallel Processors, 3e, they have a criterion for determining if access will be coalesced, which is stated as:

Accesses in a warp are to consecutive locations if the index in an array access is in the form of
A[ (expression with terms independent of threadIdx.x) + threadIdx.x ];

Here’s my mmult kernel, with three different access pattern cases: original, pattern-1, and pattern-2.
I only uncomment one per build, and then profile with Nsight Compute.

global void matrixMulGPU( int * a, int * b, int * c )
{
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

int val = 0;
if( row < N && col < N )
{
    for(int k = 0; k < N; ++k)
      //val += a[row*N + k] * b[k*N + col];  // Original, correct access pattern
  //val += a[row*N + k] * b[row*N + k];  // test-only, Pattern-1
  val += a[k*N + col] * b[k*N + col];    // test-only, Pattern-2
     
    c[row*N + col] = val;
}

}

I’m launching this with threadsPerBlock = {32,16,0}

In Warp State Statistics I see: Avg. Active Threads Per warp = 32
In the Memory Workload Analysis Chart section,
L1/TEX Cache
Global Load, Sectors/Req = 2.5 (for ORIGINAL pattern)
Global Load, Sectors/Req = 1 (for Pattern-1)
Global Load, Sectors/Req = 4 (for Pattern-2)

BUT based on the Kirk/Hwu criterion, I would expect that Pattern-2 would have the best coalesced access,
because the col variable is the one isolating the threadIdx.x term.
This doesnt seem to be what Nsight Compute is saying, which is that Pattern-1 has the best memory access because less requests per sector is better…
Am I overlooking something?
Thanks.

mstrengert · April 15, 2021, 3:40pm

A fully coalesced 4 bytes memory access with 32 active threads per warp, has an expected value of 4 for metric sectors/request. The total amount of data requested is 32*4bytes = 128 bytes. As the access is coalesced and for this discussion aligned with the sector boundaries, 128 bytes equal 4 consecutive 32 byte sectors in memory.

|--------| |--------| |--------| |--------| (sectors)
 XXXXXXXX   XXXXXXXX   XXXXXXXX   XXXXXXXX  (thread accesses)

And as stated, this is indeed the optimal case, if all 32 threads are active and have unique addresses. However, an access can have still fewer sectors/request if A) less threads are active or B) the addresses are not unique. In the given examples, the latter case is true for some of the accesses. First, let us consider the case where all 32 threads are active, and all read from the very same address. The equivalent pattern to the above example would look this this:

|--------| |--------| |--------| |--------| (sectors)
 X                                          (thread accesses)

Now we only need to access a single sector as all 32 threads requested the same index in the data array. Therefore, sectors/request is 1.

In your example code, the accesses with index calculation row*N + k fall in this category. row is defined as threadIdx.y + blockIdx.y * blockDim.y. That is constant for any given warp of 32 threads as your block size is {32,16,0}, i.e., the threads of a warp have different x-indices, but the same y-index. N is likely constant. And the loop counter k is also the same for the 32 threads in a warp. Consequently, the address for all 32 threads in the warp is identical. And Nsight Compute reports the correct 1 sector/request. The index calculation k*N + col leads to a fully coalesced memory pattern with 32 threads active. Sectors/request for that is 4. This leads to the following three cases in your example:

Original access: (a[row*N + k] = 1s/r) + (b[k*N + col] = 4s/r) = 5sectors/2request = 2.5s/r
Pattern-1:       (a[row*N + k] = 1s/r) + (b[row*N + k] = 1s/r) = 2sectors/2request = 1.0s/r
Pattern-2:       (a[k*N + col] = 4s/r) + (b[k*N + col] = 4s/r) = 8sectors/2request = 4.0s/r

Topic		Replies	Views
Do these two global memory coalesced access pattern have same performance in thoery? CUDA Programming and Performance cuda	2	406	December 17, 2022
Coalesced Memory access related doubt CUDA Programming and Performance	13	2293	December 9, 2010
How to understand the alignment of 2D array and fully coalesce of the memory access CUDA Programming and Performance	7	3690	July 27, 2016
32 byte coalesced access is faster than 128 byte coalesced access? CUDA Programming and Performance	2	1205	October 17, 2020
coalescing problem CUDA Programming and Performance	4	1162	August 8, 2011
Visual debugger to see if mem access is coalesced CUDA Programming and Performance	7	1146	November 1, 2011
Uncoalesced Local Accesses Nsight Compute	0	1395	April 2, 2024
Coalesced memory access example CUDA Programming and Performance	2	3394	March 28, 2011
questions about coalescing access coalescing access CUDA Programming and Performance	8	2117	November 13, 2009
1 coalesced global memory load = 16 loads? CUDA Programming and Performance	0	963	January 23, 2011

Profiling coalesced memory accesses confusion

Related topics