I’m using a simple matrix mult kernel to test global memory access patterns for efficiency.
Based on Kirk & Hwu, Programming Massively Parallel Processors, 3e, they have a criterion for determining if access will be coalesced, which is stated as:
Accesses in a warp are to consecutive locations if the index in an array access is in the form of
A[ (expression with terms independent of threadIdx.x) + threadIdx.x ];
Here’s my mmult kernel, with three different access pattern cases: original, pattern-1, and pattern-2.
I only uncomment one per build, and then profile with Nsight Compute.
global void matrixMulGPU( int * a, int * b, int * c )
{
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;
int val = 0;
if( row < N && col < N )
{
for(int k = 0; k < N; ++k)
//val += a[row*N + k] * b[k*N + col]; // Original, correct access pattern
//val += a[row*N + k] * b[row*N + k]; // test-only, Pattern-1
val += a[k*N + col] * b[k*N + col]; // test-only, Pattern-2
c[row*N + col] = val;
}
}
I’m launching this with threadsPerBlock = {32,16,0}
In Warp State Statistics I see: Avg. Active Threads Per warp = 32
In the Memory Workload Analysis Chart section,
L1/TEX Cache
Global Load, Sectors/Req = 2.5 (for ORIGINAL pattern)
Global Load, Sectors/Req = 1 (for Pattern-1)
Global Load, Sectors/Req = 4 (for Pattern-2)
BUT based on the Kirk/Hwu criterion, I would expect that Pattern-2 would have the best coalesced access,
because the col variable is the one isolating the threadIdx.x term.
This doesnt seem to be what Nsight Compute is saying, which is that Pattern-1 has the best memory access because less requests per sector is better…
Am I overlooking something?
Thanks.